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1 .  Introduction 

The  book  by  Forsythe  and  Moler  [1967]  gives  an  error  analysis 
for  Gaussian  elimination  drawn  from  the  work  of  Wilkinson  and  an  analysis  for 
iterative  improvement  due  to  Moler.   In  this  paper  we  do  a  careful  backward 
error  analysis  using  a  different  idea  of  what  it  means  for  a  perturbation 
to  be  small,  namely  that  each  datum  is  subject  to  a  small  relative  change. 

A  system  of  n  equations 

n 

I   a..x.  =  b . ,  1  <  i  <  n, 

in  n  unknowns  x. ,  1  <  i  <  n,  is  often  written  in  matrix  notation  as 
l    —   — 

Ax  =  b. 

The  solution  x  computed  in  floating  point  arithmetic  is  not  generally 

exactly  equal  to  x,  but  it  still  may  be  acceptable  if  it  satisfies  one  of 

two  criteria: 

(i)   It  fulfills  the  accuracy  requirements  of  the  problem  poser. 
This  usually  involves  some  measure  of  how  far  x  is  from  x. 
For  example,  in  determining  the  coefficients  of  an  inter- 
polating polynomial  it  is  some  norm  of  the  residual 
r  =  A(x  -  x)  which  is  of  concern.   More  often  though  it 
is  some  norm  of  the  error  x  -  x  which  is  of  concern, 
(ii)   It  is  the  exact  solution  of  a  problem  which  differs  from 

the  given  problem  by  less  than  the  uncertainty  in  the  data, 
so  that  it  is  conceivable  that  x  exactly  solves  the  original 
problem.   Uncertainty  in  the  data  is  generally  present  if 
only  because  of  the  roundoff  error  introduced  when  the 
numbers  are  put  into  the  computer. 

It  is  this  second  criterion  which  motivates  backward  error  analysis. 


In  §2  we  give  mathematical  form  to  the  informal  discussion  of 
ill  conditioning  found  in  Hamming's  [1971]  book  Introduction   to  Applied 
Numerical  Analysis.      The  condition  of  a  system  is  the  sensitivity  of  the 
solution  to  uncertainty  in  the  data,  and  this  sensitivity  is  often  measured 
by  a  condition  number.   The  usual  definition  of  the  condition  number  is 
based  on  the  idea  that  the  uncertainties  in  the  data  are  roughly  the  same 
size  in  an  absolute  sense.   However,  if  we  suppose  that  these  uncertainties 
have  roughly  the  same  relative  size,  then  we  obtain 


A_1 1  I A I  I x 


as  the  condition  number  of  a  linear  system  where  ||°||  is  the  max  norm. 

There  are  at  least  a  couple  of  reasons  for  believing  that  this  second 

approach  is  more  realistic.   First,  most  numerical  computations  are  done 

in  floating  point  rather  than  fixed  point  arithmetic,  and  for  floating 

point  computation  the  conversion  of  data  to  machine  represented  to 

numbers  results  in  errors  of  the  same  relative  size.   Second,  measurement 

errors  are  usually  more  nearly  the  same  in  relative  size  than  in  absolute  size. 

In  §3  we  pose  the  question:   What  is  the  least  amount  by  which 
A  and  b  must  be  perturbed  so  that  x  exactly  solves  the  perturbed  problem? 
The  answer  to  this  question  turns  out  to  be 

|  (b  -  Ax)  .  | 
max     '         i ' 


i   (|b|  +  |A| |x|) . 


which  we  call  the  "backward  error."   If  this  is  less  than  the  unit  roundoff 
error  u,  then  the  second  acceptability  criterion  is  satisfied.   And  if  it 
can  be  shown  that  an  algorithm  produces  a  backward  error  that  is  always 
bounded  by  some  fixed  multiple  K(n)u  of  the  unit  roundoff  error,  then  by 
increasing  the  precision  of  the  intermediate  results  by  a  factor  K(n) ,  the 


second  acceptability  criterion  can  be  met.   An  algorithm  with  this  desirable 

property  is  said  to  be  stable.   If  an  algorithm  is  only  stable  for 

infinitesimal  values  of  u,  it  is  said  to  be  asymptotically  stable. 

The  stability  of  Gaussian  elimination  with  row  pivoting  (usually 

called  partial  pivoting)  is  examined  in  §4.   By  using  an  example  of 

Hamming's  it  is  shown  that  row  pivoting  is  not  asymptotically  stable  even 

for  systems  with  equilibrated  matrices.   Then  by  means  of  a  careful  error 

analysis  performed  in  Appendix  A  a  bound  on  the  backward  error  is  obtained 

which  contains  the  quantity 

max  , 1  -1.  I I  *  I N 
(D.  A  x  )  . 
1    '1   '  '  '  l 

min  .  1  -1.  i  i - i x 
•   ( Pi  A  x  )  • 

where  D    is  the  matrix  of  row  scaling  factors.   This  quantity  is  minimized 

by  choosing 

D1  =  diag(|A| |*|), 

which  calls  for  the  i-th  row  to  be  divided  by  la...  x,   +  a._  x„  I  +.  .  .+ 

1  ll   1 '    '  i2   2 ' 

la.   x  |.   It  is  shown  that  with  such  a  choice  for  Dn  row  pivoting  would  be 
1  in  n'  1 

stable.   Of  course  this  is  impractical,  which  explains  Stewart's  [1973,  p.  151 
observation  that  "In  spite  of  intensive  theoretical  investigation,  there  is 
no  satisfactory  algorithm  for  scaling  a  general  matrix."   Nonetheless,  the 
ratio 


max 
i 


(|A||x|) 


mm  ,  I  .  I  I  *  I  v 
•   C  A  x  )  . 
J      "    J 


is  an  excellent  a  posteriori   measure  of  how  poorly  scaled  the  system  is. 

Sometimes,  programming  considerations  (Sherman  [1976])  call  for 
the  use  of  column  pivoting  instead  of  row  pivoting,  where  by  column 
pivoting  we  mean  that  columns  are  interchanged  so  that  each  pivot  is  the 
largest  in  its  row.   In  §5  it  is  shown  that  column  pivoting  could  be 
made  stable  if  it  were  somehow  possible  to  scale  the  columns  with 


the  matrix  of  scale  factors 

D2  =  diag(|x|). 
This  calls  for  each  column  to  be  multiplied  by  its  corresponding  computed 
solution  value.   A  measure  of  ill  scaling  is  given  by 

(|A|e).  ||x|| 
max   '  '   1  "  " 

i     (|A||i|). 

where  e  is  the  vector  of  all  ones. 

Row  pivoting  may  be  regarded  as  the  generalization  of  complete 
pivoting  in  which  the  ordering  of  the  columns  is  arbitrary,  and  similarly 
column  pivoting  as  the  generalization  in  which  the  ordering  of  the  rows  is 
arbitrary.   From  this  observation  it  follows  that  the  results  of  both  §4 
and  §5  apply  to  complete  pivoting.   However,  one  suspects  that  the  error 
of  complete  pivoting  satisfies  an  error  bound  which  is  appreciably  better 
than  simply  the  smaller  of  the  bounds  for  row  and  column  pivoting. 

In  §6  iterative  improvement  is  examined  in  the  hope  that  the  poor 
stability  properties  of  Gaussian  elimination  can  be  corrected.   It  is  shown 
that  a  single  iteration  of  iterative  improvement  performed  in  single 
precision  is  enough  to  make  Gaussian  elimination  asymptotically  stable. 

Before  proceeding,  it  might  be  interesting  to  demonstrate  the 
instability  of  complete  pivoting  with  a  simple  2x2  system  of  equations. 
Consider  Ax  =  b  where 


A  = 


3   3 
-1   0 


and  b  = 


The  coefficient  matrix  A  is  equilibrated  according  to  the  definition  of 
Forsythe  and  Moler  [1967,  p.  45].   Using  rounded  t  digit  decimal  (floating 
point)  arithmetic  the  elimination  step  yields  A'x  =  b'  where 


A'  = 


3      3 
0   .99- • -9 
and  so  the  computed  solution 


and  b*  = 


1 
33---3 


x  = 


.33---3  x  10 
.33-V3 


-t- 


The  backward  error  is  determined  by  considering  perturbed  problems  of 
the  form 


3(1  +  6U)   3(1  +  612) 
-  (1  +  621)      0 


1  +  6 
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and  choosing  the  relative  changes  6..  so  as  the  minimize  the  maximum  <5 .  . 

ij  ij 

In  this  case,  6_   must  be  chosen  to  be  -1,  and  so  the  backward  error  is 


100%  regardless  of  the  precision  t. 


2.  Condition  of  Linear  Systems 

The  condition  of  a  problem  is  the  sensitivity  of  its  solution  to  uncertainties 
in  the  problem  data.   The  importance  of  this  concept  is  that  it  indicates  the 
amount  of  accuracy  that  one  should  reasonably  expect  for  the  solution  of  a 
problem  with  inexact  data.   And  even  for  problems  with  exact  data,  the 
conversion  of  the  numbers  to  the  computer's  floating  point  number  base  usually 
introduces  errors. 

As  the  measure  of  the  condition  of  a  problem  we  take  the  maximum 
amount  by  which  an  infinitesimal  perturbation  in  the  problem  data  can  be 
amplified  in  the  solution.   More  precisely  if  C  denotes  the  given  problem 
data  and  <J>(?)  denotes  the  solution  of  a  problem  with  data  £,  then  we  define 
the  condition  nwnbev   to  be 


lim  relative  distance  from  <{>(£)  to  c|>(£)  /?  -,  \ 

£->-£;     relative  distance  from  £  to  £ 

In  the  case  where  £  and  cj)(£)  are  scalars,  the  condition  number  is  the  absolute 

value  of  the  relative  derivative,  namely 

<KO  i 
(cf.  Bauer  [1974]).   For  linear  algebraic  systems  Ax  =  b  we  have  £  =  (A,  b) 
and  4>(£)  =  A   b.   (In  roundoff  analysis  the  number  of  equations  n  is  not 
considered  to  be  part  of  the  problem  data;  rather  we  take  the  point  of  view 
that  each  value  of  n  defines  a  separate  class  of  problems.)   There  are  two  crucial 
matters  that  have  to  be  settled:   (i)   how  to  define  relative  distance  in 
the  problem  space,  and  (ii)   how  to  define  relative  distance  in  the 
solution  space. 

Any  problem  which  is  to  be  solved  in  an  approximate  sense  is 
incomplete  unless  there  is  also  some  "metric"  specified  for  measuring  how 


good  the  approximation  is.   It  is  this  metric  that  should  be  used  in  defining 
the  "relative  distance  from  <f>(C)  to  <KO."  Often  this  metric  measures  how 
close  the  approximate  solution  is  to  the  true  solution;  in  other  cases  it 
measures  how  well  the  approximate  solution  satisfies  the  problem.   In  this 
section  we  choose  to  consider  the  problem  of  approximating  A   b  rather  than 
that  of  approximately  solving  Ax  =  b.   The  ratio 

where  |  •  j  denotes  the  max  norm  is  an  adequate  measure  of  relative  distance 
for  most  purposes  if  the  unknowns  are  appropriately  scaled. 

The  question  of  how  to  measure  the  "relative  distance  from  \    to  £" 
is  more  difficult  to  answer  because  a  completely  specified  approximation 
problem  need  not  include  a  metric  for  the  problem  space.   However,  there  is 
one  metric  which  is  always  safe  to  use,  namely  the  componentwise  relative 
error 

\i.  -  u 

max   '  1    i1 

If  the  value  of  this  quantity  is  small,  then  |  is  close  to  £  by  any  reasonable 
standard,  especially  in  view  of  the  fact  that  putting  data  into  the  computer 
results  in  small  componentwise  errors.   This  metric  has  another  advantage  in 
that  it  is  always  meaningful  regardless  of  the  physical  dimensions  of  the 
problem  data  and  thus  it  is  independent  of  possibly  arbitrary  choice  of  units 
for  the  data.   For  these  reasons  we  take  as  our  measure  of  relative  distance 
the  smallest  e  _>  0  such  that 

I  a .  .  -  a.. I  <  e|a..|  and  lb.  -b.l  <  elb.l 
This  seems  to  be  consistent  with  the  ideas  expressed  in  Hamming's  [1971]  book 
Introduction  to  Applied  Numerical  Analysis.      On  page  117  it  is  stated  that 


"The  term  'ill-conditioned'  is  ill  defined.   The  vague  idea 
is  that  small  changes  in  the  initial  system  can  produce  large 
changes  in  the  final  result.   If  we  are  to  take  floating 
point  seriously,  then  we  should  say  'relatively  small  changes' 
and  'relatively  large  changes'." 

and  on  page  122  it  is  stated  that 

"...the  system  is  indeed  ill  conditioned  because,  no  matter 
how  we  try,  we  are  unable  to  solve  the  system  so  that  the 
answer  is  not  sensitive  to  small  changes  in  the  original 
coefficients. " 

Thus  it  seems  that  by  "relatively  small  changes  in  the  initial  system" 

Hamming  means  "relatively  small  changes  in  the  coefficients   of  the  initial 

system."  A  similar  thought  is  expressed  by  Kahan  [1966,  p.  795],   It  is 

worth  mentioning  that  this  approach  has  the  advantage  of  forcing  the  perturbed 

matrix  A  to  have  the  same  sparsity  structure  as  the  original  matrix  A,  making 

it  more  plausible  to  regard  A  as  the  result  of  perturbing  the  original  physical 

problem. 

Having  chosen  our  metrics,  we  are  in  a  position  to  determine  the 
condition  number  of  a  system  Ax  =  b .   We  begin  by  obtaining  bounds  on  the 
uncertainty  in  the  solution  due  to  the  uncertainty  in  A  and  b.   Bounds  of 
this  type  also  appear  in  Bauer  [1966].   Our  notation  uses  inequalities 
between  arrays  to  mean  inequality  of  the  corresponding  components.   The 
absolute  value  of  an  array  is  also  to  be  understood  in  a  componentwise  sense. 

THEOREM  2.1.  Let   Ax  =  b  and   (A  +  6A) (x  +  6x)  =  b  +  6b  where 
|  <SA  |  <_  e  |  A  |  and   |  6b  |  <  e  |  b  |  .  Then 

«*ll<  ,  II  lA-1||A||x|  +  lA^llbl  || 


™  "     (l-elllA^llAllDHxH 

provided  that  the  denominator  is  positive. 
PROOF.   We  have  that 


6x  =  A  16A(x  +  fix)  +  A  16b  ,  (2.2) 


and  so 


5x|  £  |A  1||6A|(|x|  +  |6x|)  +  |A  1||6b| 

<  £  I A-1 1  I A I C I x I  +  |6x|)  +  £  | A-1 !  I b I  . 


Therefore 


6x||  <  elllA^llAlM  +  |A-1||b|||+  clllA-MlAlllllfixll  .        Q.E.D. 


Note.   Bauer  [1966]  shows  that  the  bound  of  Theorem  2.1  can  be 
improved  by  replacing  (1  -  e  ||  |A   |  |a|  ||)    by  ||(I  -  e  |A   |  |a|)   ||,  and  so  it 
is  not  necessary  that  e  ||  |A   |  |a|  ||  <  1  but  only  that  the  spectral  radius  of 
e    | A   | |a|  be  less  than  1. 

THEOREM  2.2.  Let  Ax  =  b .  Then  there  exist  6k  and  6b  suah  that 
| 6 A  J  =  e  |a| ,  | 5b |  =  e  |b|,  and  the  solution  x  +  6x  of  (A  +  6 A) (x  +  6x)  = 
b  +  6b  satisfies 

IISxl 


>        HlA^llAllxl    +    lA^Mblll 
"  (l  +  .lllA^HAllWWI 

PROOF.      Let   I  be   such   that 

(|a_1||a||x|   +  | a-1  1 1 b | ) ^  =  ||  |a_1|  |A|  |x|   +  |A_1||b||| 

Define   6A  and  6b  by 

6ajk  =   sgn(a£jXk)    £    |ajk| 


and 


6b  .    =   sgn(a.  .  )    £    |b  . 
3  ^J  J 


where   A        =    (a .  .  ) .      Then 


(A  16Ax  +  A  16b)      -   S    T.   a        6a        x,+  E   a.  .    6b. 

=    £    C | A-1 1  I A I | x |    +    |A_1||b|)£ 
=   £|||A-1||A||x|    +    lA^llblH    , 


but  from  (2.2) 
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(A  16Ax  +  A  ^b)   =  (6x  -  A  16A6x)Ji  , 


and  so 


e  II  I  A  1|  |a|  |x|  +  |  A  1\  lb  I  ||  <  II  6x||  +  £  II  |A  1|  |a|  II  ||6x||  .       Q.E.D. 


THEOREM  2.3.  The  condition  numberi   as  defined  by   (2.1),  of  a 
linear  algebraic  system   Ax  =  b  is 


A  x 


-1 


+  lA^llb 


PROOF.   The  condition  number  of  a  linear  algebraic  system  Ax  =  b  is 


lim 


6x  M  x 


e(6A,6b)  -y   0   e(6A,6b) 

where  e(6A,6b)  =  min{e  _>  0:  J  6  A  j  ;<  e  |  A  |  ,  |  6b  |  <_  e  |  b  |  }  and  6x  satisfies 

(A  +  6A) (x  +  6x)  =  b  +  6b .   Consider  any  sequence  (6A  ,  6b  )  for  which 

m    m 

e  (6 A  ,  6b  )  -*■   0  as  m  -*■   °°.   By  Theorem  2.1  we  have 
mm 


6x 


,      (kk      xu    ^    ||  |A  1|  |A|  |x|  +  |A  1|  |b|  || 

<    e(6A    ,6b    )   LLJ LJ — u — ' ' — - — LJ — L-n 

~  mm,.  ,«.      „,     v   iii.-lii.iiK 


(1   -  e(6A   ,6b    )        A 
m        m 


for   sufficiently   large  m.      Therefore 

1^     HfixmM|x|l        <     HlA^llAllxl    +    lA^MbMI 

m  ^  co   e(6A   ,6b    )      -  ||x|| 

mm  1 1     ii 

which   gives   an  upper  bound   on    the    condition  number.      Let   e      be   a   sequence 
converging   to    zero.      By  Theorem   2.2   there   exists   a   sequence    (6A    ,    6b    )    such 


that    e(6A    ,    6b    )    =   e      and 
mm  m 


ll«*    II 
ii      mii 


-1 


-1 


>      e 


m 


| A]  |x|    +    |a      I  |b|    || 

(i  +  OIIa^iiaIIDMI 


Therefore 


— —       6x    \\/\  x  iiIa_1|IaM    i         I A  —1 1  i ,  i  ti 

lim     "      m"  "     "  ||  |A         |A| |x|    +    |A         |b  | 


m  ->  co 


m 


x 


which  gives  a  lower  bound  on  the  condition  number.   Q.E.D. 
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In  subsequent  sections  of  this  paper,  we  will  consider  the  effects 
of  perturbing  only  the  elements  of  the  coefficient  matrix. 

THEOREM  2.4.  Let   Ax  =  b  and   (A  +  6 A)  (x  +  6x)  =  b  where    |<5A|  <_ 
e | A | .  Then 

fix 


elllA^llAllxMI u 

(l-.HlA-HjAllMWI 
PROOF.   Similar  to  that  of  Theorem  2.1.   Q.E.D. 

THEOREM  2.5.  Let   Ax  =  b.  Then  there  exists   5A  such  that    | 6A |  =  £  |a| 

and  such  that   the  solution   x  +  5x  of   (A  +  6A) (x  +  5x)  =  b  satisfies 

6x 


,  elllA^llAllxlH 

"     (l  +  eHlA^llAllDHxIl 

PROOF.      Similar   to   that   of   Theorem  2.2.      Q.E.D. 

It   follows   from  these   last   two   theorems   that  when  only  A  is  subject 

to  uncertainty  the   condition  number   is 


A 
Cond(A,  x)  =  [UJ1 


ALkLU 


-ii 

x 
Since  ||  |a_1|  |a|  |x|  ||  <  ||  |a_1|  |a|  |x|  +  |A_1|  |b|  ||  £  2||  |a_1|  |a|  |x|  ||, 

Cond(A,  x)  is  also  adequate  for  the  case  where  both  A  and  b  are  subject  to 

uncertainty.   A  similar  quantity 

||A-1||,||Ae(.)||  |x.| 

K(A,  x)  =  ^r 


X 


is  used  by  Van  der  Sluis  [1970a],  which  he  calls  [1970b]  the  "condition 

number  of  the  solution."  Here  e,.x  denotes  the  i-th  unit  vector. 

(j) 

The  condition  number  of  a  matrix  A  could  be  defined  as  the 

T 
maximum  value  of  Cond(A,  x),  which  is  achieved  with  x  =  e  =  (1,  1,...,  1)  . 

Thus 

Cond(A)  =  Cond(A,  e)  =  ||  |a_1  |  | a|  ||. 

This  quantity  is  more  satisfying  as  a  measure  of  ill  condition  than  the 

usual  cond(A)  =  ||a  ||||a||  for  a  couple  of  reasons.   First,  the  matrix 
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| A   | |A|  is  a  mapping  of  the  solution  space  into 

itself,  which  means  that  the  quantity  ||  |A   |  |a|  ||  can  be  defined  entirely  in 

-1 


terms  of  the  solution  space  norm.   whereas  cond(A)  =  ||A     ||A||  is  defined  in 

terms  of  a  solution  space  norm  and  a  residual  space  norm,  which  seems  quite 

unnecessary.   Second,  the  quantity  Cond(A)  is  invariant  under  row  scaling. 

Multiplying  a  system  of  equations  by  a  diagonal  matrix  does  not 

change  the  problem  in  any  fundamental  way.   For  example,  all 

systems  Dx  =  b  where  D  is  diagonal  are  well  conditioned.   Accordingly,  we 

have  that  Cond(D)  =  1;  whereas  cond(D)  can  be  arbitrarily  large. 

Example.      According  to  Hamming  [1971,  p.  120],  the  system  Ax  =  b 
is  well  conditioned  where 


A  = 


3   2   1 
2   2e  2e 
1   2e   -e 


b  = 


3  +  3e 
6e 
2e 


The  inverse  of  the  coefficient  matrix  and  the  solution  are  given  below: 


-.6e    .4 


.2 


C1  =  - 

1-1. 8e 


Hence 


lA"1!  lAl  =    X 


.4   -J.e  1-.3   .2e  1-.6 
.2    .2e_1-.6  -.4e-1-.6 


1+1. 8e    2.4e     1.6e 
-1 


x  = 


1-1. 8e 


4e  +1.2   1.4-.6e 
-1 


8e 


1.6     l-.6e 


and 
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a  1| |a| IxI  +  |a  1|  Ibl  = 


1-1. 8e 


9.6e  +  3.6e' 
4.8  +  2.4e 
6  -  2.4e 


which  shows  that  the  system  is  well  conditioned.   However, 


Cond(A)  = 


,8e  -1  +  2.6  -  .6e 


1-1. 8e 

which  indicates  that  the  system  would  be  ill  conditioned  for  some  different 
right  hand  side  b,  and  in  fact,  Hamming  [1971,  p.  122]  gives  such  an  example, 
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3.  Stability  of  Algorithms  for  Linear  Systems 

Let  +,  -,  *,  /  denote  the  floating  point  operations  corresponding 

to  +,  -,  x,  /.   Every  reference  to  a  floating  point  result  x  o  y  carries  with 

it  the  assumption  that  x,  o,  and  y  are  such  that  the  result  is  well  defined. 

Nothing  is  assumed  about  the  floating  point  arithmetic  except  that  the 

relative  roundoff  error  is  bounded  by  u/(l  +  u)  where  the  unit  roundoff 

error  u  is  a  small  positive  number;  that  is, 

x  6  y  =  (x  o  y)  (1  +  6) 
for  some  6  depending  on  x,  o,  and  y  which  satisfies 

1  '  —  1  +  u 

It  follows  from  the  above  condition  that 

x  O  y 
x  o  y  =  — 

1  +  6' 
where  |6'|  <_  u.   Note  that  for  rounding  u  =  -r-  0    and  for  chopping  u  =  8 
where  B  is  the  base  and  t  is  the  number  of  base  3  digits  in  the  fraction  of 
the  floating  point  numbers. 

For  any  computed  solution  x  we  define  the  relative  backward  error 
to  be  the  smallest  real  number  n,~>.  such  that 

(A  +  6A)(x  -  6x)  =  b  +  6b 
for  some  6 A,  6b,  and  6x  with  1 6 A |  <_  r\(^.\k\,    |6b|  <^  n /^\ |b|  ,  and 
|6x|  <_  ri/^\|x  -  6x|  .   The  backward  error  can  be  interpreted  in  the  following 
way:   The  computed  solution  x  is  the  rounded  solution  of  a  problem  with 
rounded  data  where  H/o\  is  the  maximum  relative  roundoff  error.   Thus,  if 
Hz-n  is  no  larger  than  the  unit  roundoff  error  u,  then  our  solution  is  as 
good  as  our  data  deserve;  otherwise,  improved  accuracy  may  be  justified, 
perhaps  by  using  iterative  improvement.   Further  motivation  for  this  definition 
is  given  in  Miller  [1975]  and  Bauer  [1974].   If  a  solution  cannot  be  computed 
because  the  matrix  is  nearly  singular,  then  the  backward  error  is  defined  to  be 
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the  smallest  real  number  H/o\  such  that  A  +  6A  is  singular  for  some 


6A 


with  |  6A|  <_   H/o\  |A[  . 

Stability  of  the  algorithm  means  that  there  exists  a  stability 
constant  K(n)  and  a  stability  threshold  u(n)  >  0  both  independent  of  the 
problem  data  (A,  b)  such  that  the  relative  backward  error 

^(3%  1  K(n)u 
provided  that  u  £  u(n).   A  weaker  concept  asymptotic  stability   allows  the 
threshold  u(n)  to  be  data  dependent.   These  two  types  of  stability  are  the 
same  as  the  "backward  stability"  and  "asymptotic  backward  stability"  used 
by  Miller  [1972]. 

The  backward  error  n  .    .  is  not  easy  to  determine,  and  for  this 
reason  we  introduce  two  variants  of  the  backward  error  which  are  easier  to 
compute.   Let  H^9>.  be  the  smallest  real  number  such  that 

(A  +  6A)x  =  b  +  6b 
for  some  6A  and  6b  with  |6A|  <_   n,?v|A|  and  |6b|  _<  n,„v  |b|.   Let  n,-,^  be  the 
smallest  real  number  such  that 

(A  +  6A)x  =  b 
for  some  6A  with  |  6A  |  Ji  n  ,-,  >J  A  |  .   Naturally  r\(r>\   .1  M/n\  i.n^x- 

THEOREM  3.1.  Let   I  =  {i:   (|a||x|  +  Ibl).  =  0}.  The  backward  error 

r  |  (b  -  Ax) .  | 

JiTl   (lAl|xl  +  lb]).    V   (b  "  A^i  -   0  /«  i  €  I. 


n(2)  =  \ 


+  oo 


otherwise . 


PROOF.   First  consider  the  case  where  (b  -  Ax).  ^  0  for  some  i£  I, 
Suppose  that  n,^  <  +  °°.   Then  there  exist  6A  and  6b  such  that  (A  +  6A)  x 


=  b  +  6b  where  1 6A  |  <_  T\,~Ak\    and  1 6b  |  _<  n,^  |b 


We  have 
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|b  -  Ax|   =    |fiAx  -  fib | 

£    |6A| |x|    +    | 6b | 

£  n(2)   (|a| |x|  +  |b|),  (3.1) 

which  is  impossible  since  (b  -  Ax).  ^  0  and  (|a||x|  +  |b|).  =  0.   Therefore 

H/9\  =  +  °°.   Second  consider  the  case  where  (b  -  Ax).  =  0  for  all  i  6  I, 

To  obtain  an  upper  bound  on  n^y  consider  the  choice 

(b  -  Ax). 
6a   _   fSgn(V|aiJ!  (|A||x|  +  |b|).  1*1- 

lj     L0  16  1, 


1*    I. 


and 

_ib  (b   -  Ax). 

6b.   -     f        l       (IA'I-I   +Ib|)± 

l  1    0  16  1. 

We  have   6 Ax  -  6b   =  b   -  Ax  or    (A  +  6A)x  =  b  +  6b,    and   so 


n / n  x  <   max 


(b  -  Ax)  J 


(2)  -  ."I  j   (|A|[x|  +  |b|). 
Since  n,.^  <  +  °°,  equation  (3.1)  must  hold;  and  so 


I  (b   -  Ax) .  | 

n.    .    >      max         /|  a  I  u  I     ■     IuIn         •  Q.E.D 

(2)  ~  ±  £  j      (|A| |x|    +   |b|)i 


THEOREM  3.2.     Let  I   =   {i    :(|a||x|).    =  0}.      The  backward  error 

—     if     (b   -  Ax)  .    =  0     for     i  6=  I , 


i 
|(b   -  Ax). 


max 

n 


iTi TW^ 


■+  oo  otherwise. 

PROOF.   Similar  to  Theorem  3.1.   Q.E.D. 

Remark.   Similar  types  of  results  apply  to  other  problems;  for 

example,  the  backward  error  for  a  polynomial  equation  a»  x  +  a  x    +. . .+ 

a  =  0  is  given  by 
n         b  j 

l    ~n      ^n-1  ,        | 
a_.  x  +  a.  x    +.  . .+  a 

n(2)-_L<L ^      ni_  . 

I         ~n  I         I        ^n_i  I    L       _L    I       I 

a_  x   +  a.  x     +. . .+  a 
1  0    '    '1     '        '  n ' 
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The  following  theorem  gives  bounds  on  the  relative  backward  error 
H/o\  in  terms  of  the  more  easily  computed  r\s-,\    and  n^-v-   These  bounds  show 
that  the  n's  are  roughly  the  same  size  when  the  backward  error  is  small, 
and  so  in  the  remainder  of  the  paper  only  the  quantity  n,,N  is  used,  which 
we  denote  simply  by  r\. 

THEOREM  3.3.  The  three  types  of  baahdard  error  satisfy 

n(2) 

<n(3)  <n(2)  .  (3.2) 


<  n(3)  <ti(1)  ,  (3.3) 


2  +  n(1)  ln(2)  1^(1)  '  (3'4) 

PROOF.   The  second  inequalities  of  (3.2),  (3.3),  and  (3.4)  are 
obvious.   The  first  inequality  of  (3.2)  is  obvious  if  ri,  ^  •>  1.   Hence 
assume  H/«\  <    1.   There  exist  5A,  6b,  fix  such  that 

(A  +  6k)  (x   -  <Sx)  =  b  +  6b 


2  +  n(2) 

n(D 

3  +   3^(1) 

n(D 

where    |6A|   £  n^J  a|  ,    |  6b  |    <_  n,.^  |b| ,    |6x|    <  n,3Jx  -   6: 

lb    -  Ax  I    =    |fiA(x   -   fix)    -  A6x   -   fib  I 


Hence 


It  is  easily  shown 


£  2n/ov|A||x  -  5x|  +  M/3Jb|.  (3.5) 


x  -  fix  < x  ,  (3.6) 

-l  -  n(3) 


and  so 

2n 


-  Ax I        (3)   (|a|  |x|  +  Ibl) 

-l  -  n(3) 


Therefore   n,2\  £  2n,2s/(l  -  ^(r>\)   which  verifies  (3.2).   The  first  inequality 

3  3 

of  (3.3)  is  obvious  if  H/^n  >.  "F.   Hence  assume  H/o\  K   T*   From  (3.5) 
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b  -  Ax  J  <_  2n,3JA|  |x  -  6x|  +  n,3x  |b  -  Ax  |  +  n  ^v  |A|  |x|  , 


and  using  (3.6)  gives 

2n 


Therefore 


(l  ■  rw)|b  ■-  A*|   (T  -^-  +  n,„.)|A||x|. 
(3)        '  —  1  -  ri/o-j     (3) 


n(3)  O  -  n(3)) 
(1  -  n(3)) 


wh 


<    3n(3)  (1  ~  3^C3>> 
(1  -  3n(3))(l  -  3n(3)) 

ich  proves  (3.3).   The  first  unequality  of  (3.4)  is  obvious  if  n,~N  >,  1. 
Hence  assume  H/9\  <  1.   We  have 

|b  -  Ax  |  <_  n(2)(|b|  +  |A|  |x|)  1  n(2)(|b  -  Ax  |  +  2 1 A  |  |x|), 
and  so 

lb  -  a* I  <  ,  n(2)  IaIIxI. 
-l-n(2) 

Therefore  1/-,>.  £  2n/-N./(l  -  H/^),  which  implies  (3.4).   Q.E.D. 

A  good  algorithm  should  (i)  return  an  acceptable  answer  most 
of  the  time  (robustness)  ,  and  (ii)  signal  failure  whenever  it  does  not 
return  an  acceptable  answer  (reliability) .   We  formally  define  an  algorithm 
to  be  reliable  if  there  exist  K(n)  and  u(n)  such  that  for  any  (A,  b)  and 
any  u  _<  u(n)  either  the  algorithm  computes  an  answer  with  n  <_   K(n)u  or  the 

algorithm  signals  failure. 

Any  algorithm  for  solving  linear  systems  can  be  made  reliable  by 
computing  the  backward  error  with  floating  point  arithmetic  and  then  accepting 
the  answer  only  if  the  computed  backward  error  is  less  than  a  prescribed  multiple 
of  the  unit  roundoff  error.   For  example,  the  next  theorem  shows  that  if  the 
computed  backward  error  f)  <  Ku,  then  we  can  conclude  that  n  <  (K  +  n)ue 
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The  residual  is  to  be  computed  in  single  precision 

r.  =  b.  -  (...(a...  x  x,  +  a._  x  x_)  .  .  .  +  a.   x  x  ) 
11         ll    1    i2    2        in    n 

or  in  double  precision 

r.  =  fl(b.  -  (...(a.,  x  x.  +  a._  x  x0)...  4-  a.   *  x  )). 
l       i         11    i    12    2        in    n 

Here  6  denotes  the  double  precision  counterpart  of  O  where  it  is  assumed  that 

x  6  y  =  (x  o  y)  (1  +  6) 

i  i     2       2 
with  |6|<_u/(l  +  u).   In  practice  the  double  precision  unit  roundoff  error 

is  either  this  small  (rounding  in  base  two)  or  smaller.   By  fl(°)  we  mean 

the  conversion  of  a  double  precision  value  to  a  single  precision  value.   It 

is  assumed  that  fl(xoy)  =  (xoy)(l+<5)  with  -u/(l+u)  <   6  £  u,  which  is  true  for 

rounding  and  chopping.   The  computed  backward  error  f)  is  determined  by 

n  =  max  (|r1|7(...(|a±1  *  k±\    +  \a±2   *  ^|)...  +  |a±n  *  XJ). 
i 

THEOREM  3.4.  If   n  is   the  computed  value  of  r\}    then 

-(n+2)u    "  _   nu  (n+2)u   -  nu 

e  n-nue      ^.n£e  n+nue 

where 

r\x       for  single  precision  residual  accumulation, 

U\      2 

V-u  for  double  precision  residual  accumulation. 

PROOF.   Let  q  be  the  computed  value  of  Ax.   By  the  usual  type  of 
error  analysis 

|q  -  Ax |  <    [(1  +  G)n  -  1] |a| |x| 

<  nu  enU | A | |x|  .  (3.7) 

We  have  that 


(i  -  y^-)2 

~    .    1  +  u 

r\   > max 


b  - 


jm 
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and 


f\   < 


(1  +  u)(l  + 


1  +  u 


) 


max 


(1  " 


U   _nT1 


b  -  q| 

Arm 


1  +  u' 

where  division  of  two  vectors  is  defined  componentwise 


This  reduces  to 


f\   <   max 


(1  +  2u)(l  +  u)n 
from  which  it  follows  that 

-(n+2)u 


Anrr- 


(1  +  2u) 


(1  +  u) 


n-2 


n  <  max 


b  - 


A^ 


F* 


(n+2)u 


Using 


and  (3.7)  gives 

_   (n-2)u 


b  -  q|  -  |q  -  Ax |  1  |b  -  Ax |  .  .  |b  ■■  q|  +  |q  "  Ax|  . 

_   (n-2)u 


,b  ■■  q|  ■  nC  e^     -|A|  |x|  1  |b  -  Ax|  <  |b  -  q|  +  nu  e 
Dividing  by  |a||x|  and  taking  the  maximum  yields 

b  -  ql  L  -   (n-2)u 
-f-1-  +  nu  e 


A  x 


max 


b  •■  a  I    -   (n-2)u 

nu  e       <  ri  <  max 


aTH 


#- 


A  xl 


Q.E.D. 


Before  concluding  this  section,  it  should  be  mentioned  that  for  some  classes 
of  problems  it  may  be  unreasonable  to  expect  an  algorithm  to  be  stable.  If  the  numbert 
output  values  is  fairly  large  compared  to  the  number  of  input  values,  then 
it  becomes  very  difficult  for  an  algorithm  to  be  stable  because  each  output 
value  must  arise  from  the  scone   perturbation  of  the  input  values.   For 
example,  Miller  [1975]  shows  that  the  usual  algorithm  for  inverting  triangular 
matrices  is  unstable.   Hence  it  seems  better  to  use  stability  as  a  relative 
concept  rather  than  an  absolute  concept.   This  idea  is  used  by  Miller  [1976] 
in  a  paper  entitled  "Roundoff  analysis  by  direct  comparison  of  two 
algorithms." 
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4.  Gaussian  Elimination  with  Row  Pivoting  and  Scaling 
This  section  applies  the  ideas  of  the  preceding  section  to  Gaussian 
elimination  with  partial  pivoting  using  row  interchanges  and  implicit  row 
scaling.   The  reciprocals  of  the  scale  factors  are  to  be  given  as  inputs 

d..  ,  d_,...,  d   to  the  algorithm,  and  so  the  pivoting  is  done  as  if  one 
1    z       n 

were  solving  D  Ax  =  D  b  where  D  =  diag(dl5  d„,...,  d  ).   To  keep  the 

L        Z  n 

notation  simple,  it  is  assumed  that  the  equations  are  numbered  according 

to  their  ordering  after  all  row  interchanges  have  been  performed.   The 

computations  of  the  algorithm  are  as  follows: 

(1) 
a  .  .   =  a  .  .  , 

1.1      ij 


for  k  -  l(l)n  -  1, 


afk+1)=a<k)  -m„*a<*>,   l.j  »k+l.    (4.2) 
ij       ij     ik    kj      'J  - 


'0   if  i  <  j, 
y  m.  .      if  i  >  j, 


IJ 


ij 


<   aij   lf  i  -  j' 
U   if  i  >  j, 


(4.3) 


yi  =  V 

for  i  =  2(l)n,  y.  =  b.  -  (...(l±1   x  ¥±  +  l^   x  y2)  .  .  .  +  l±  ±mml   x  y._1)  , 

x   =  y  /u   , 
n    n  nn 

for  i  =  n  -  1(-1)1,  (4.4) 

x.  =  (y.  -  (...(u.  ,  -  x  x.,1  +  u.    .  x  x.,0)...  +  u.   ><x  ))/u... 
i     l         i,i+l    l+l    i,i+2    i+2        in    n    n 

It  is  assumed  that  the  selection  of  a  pivot  is  done  exactly  so  that 

In  cases  where  there  are  more  than  one  suitable  pivot,  the  one  with  the  lowest 
row  index  is  chosen.   The  assumption  of  exact  pivot  choice  avoids  some  minor 
technical  difficulties,  and  it  also  makes  for  a  sharp  error 
bound  in  the  case  where  there  is  no  scaling. 
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It  is  important  to  appreciate  the  nature  of  the  functional 
relationship  between  x  and  D.   The  computed  solution  x  is  a  function 
£(P)  of  the  row  permutation  P,  which  in  turn  is  a  function  11(D)  of  the 
scaling  matrix  D.   (Note  that  H(D)  is  also  defined  for  values  of  D  which 
are  not  floating  point  numbers  because  the  algorithm  does  not  perform 
floating  point  arithmetic  on  D.)   If  x  is  viewed  as  a  function  defined 


in  (d..  ,  d_,...,  d  )-space,  it  would  be  constant  over  regions  bounded  by 

(k) 
hyperplanes  passing  through  the  origin.   For  example,  let  a..   be  the  values 

corresponding  to  a  certain  choice  d.  =  d.  of  scale  factors.   Then  x  is 

1    1 

constant  for  all  values  of  (d^  ,  d  ,...,  d  )  which  satisfy 


d.   > 

x  ■ 


ff 


L(k) 

kk 


i  >  k. 


It  is  necessary  for  the  completeness  of  the  theory  that  the  Gaussian 
elimination  algorithm  be  extended  to  permit  the  use  of  zero  as  a  scaling 


factor.   For  any  diagonal  matrix  D  =  diag(d  ,  d„,...,  d  )  we  let  D  denote 
the  matrix  diag({d  }  ,  {d„}  ,...,  {d  }  )  where 


{d}   = 


d   if  d  4   0 


e   if  d  =  0, 


That  is,  D   is  the  matrix  D  with  all  the  zero  diagonal  elements  replaced  by 
e.   The  condition  (4.5)  is  replaced  by  the  following: 


lim 
e  -v  0 


u.r1 .« 

i  e   lk 


{dk}e   akk 


<  1 


,  r].         (k)  ,  (k) 
Let  e  =  minljd.  a,,  'a-i. 


d   =  0,  d.  4   0}.   Then  it  can  be  easily  shown 

K.  X 


that  for  any  c   with  0  <  lei  <  e  the  scale  matrix  D  has  the  same  effect  on 
the  choice  of  pivots  as  does  D. 

Unfortunately,  Gaussian  elimination  with  pivoting  is  unstable. 
This  instability  arises  in  the  decomposition  stage  when  the  quantities 
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(k+1)    (k)  .     *   (k) 
a .  .     =  a .  .   -  m . ,  x  a,  . 
13       iJ     ik    kj 

(k) 


(k) 


are  being  formed.   If  |m.,  a,  .    >>  |a    |,  then  the  error  in  a..    is  not 


(k+1) 


ik  "kj 
(k) 


lj 


13 


very  small  relative  to  a..  ,  and  so  in  our  backward  error  analysis  we  cannot 


throw  this  error  back  into  a 


(k) 


The  extreme  case  occurs 


(k)  (k+1) 

when  a..   =0  and  a..     ^  0,  which  is  commonly  called  "fill  in."   For 

sparse  systems  of  equations  it  is  quite  common  to  order  the  rows  so  as  to 

avoid  fill  in.   This  reduces  computational  cost,  and  it  apparently  may 

also  contribute  to  stability. 

The  instability  of  Gaussian  elimination  has  been  pointed  out  bv 

Hamming,  who  on  page  119  of  his  book  [1971]  announces  the 

"Theorem  Pivoting  can  take  a  well-conditioned  system  into  an 
ill-conditioned  system  of  simultaneous  linear  equations." 

and  on  page  123  states 


"We  have  not  justified  the  pivoting  method;  rather  we  have 
shown  that  it  is  an  'old  wives'  tale.'   But  like  most  old 
wives'  tales,  it  is  a  mixture  of  truth  and  mystic  faith." 


To  prove  his  theorem,  Hamming  uses  the  example  discussed  at  the  end  of  §2 
For  this  example,  one  elimination  step  with  partial  or  complete  pivoting 
yields  the  system  A'x  =  b'  where 


A'  = 


-4       -2 
0  -j-   +  2e  ~  +  2e 


0   -2       -1 
T  +  2e     T  -  e 


,  b'  = 


3  +  3e 
-2  +  4e 
-1  +  e 


(4.6) 


assuming  exact  arithmetic.   This  problem  is  ill  conditioned  for  small  e3 
since 


Cond(A',x)  = 


Se  -1  -  3  +  3e 
1  -  1.8e 
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If  the  elimination  were  performed  in  floating  point  arithmetic,  then  a 
slight  perturbation  of  (4.6)  could  result,  which  may  have  a  solution  which 
differs  from  the  true  solution  by  an  amount  proportional  to  c   .   This  kind 
of  error  could  not  arise  from  slightly  perturbing  the  original  problem 
because  it  has  a  condition  number  of  about  6.   For  example,  suppose  that 
the  computed  righthand  side  of  (4.6)  was 


3  +  3e 


b'  = 


-2  +  4e  - 


-1  +  e  + 


and  everything  else  was  exact.   Then 


1  -  e  -1  u/8 


1  +  e  X   u/4 


and  by  Theorem  3.2  the  backward  error  is  u/(8e  +  u) . 

A  related  observation  was  made  by  Gear  [1975] 


"It  might  be  possible  to  say  that  6A  represents  a  perturbation  to 
the  original  physical  problem  if  the  sparsity  structure  of  6k 
were  the  same  as  that  of  A.   Unfortunately,  we  will  show  that  such 
a  demand  on  the  structure  of  6A  can  lead  to  very  large  bounds  on 
I 6A J  |,  bounds  probably  dependent  on  the  condition  number  of  A." 


This  was  supported  by  the  example 


11-1-1 
0  e   0   0 
0  0   e   0 
JL     0   0   1 
for  which  Cond(A)  =  4. 


b  = 


,   x  - 


1 

.-1 

-1 
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In  one  of  the  theorems  that  follow,  it  is  shown  that  the  situation 
is  not  quite  as  bad  as  Gear  suggests.   The  bounds  on   |6A| |  depend  not  on 
how  ill  conditioned  the  problem  is  but  on  how  badly  scaled  the  equations  are. 
We  begin  by  obtaining  a  totally  a  priori   error  bound  for  Gaussian  elimination. 
The  proof  is  modeled  after  that  of  Forsythe  and  Moler  [1967],  which  is  mostly 
borrowed  from  Wilkinson  [1963].   However,  our  error  bound,  like  that  of 
Van  der  Sluis  [1970b],  is  more  informative  than  that  of  Forsythe  and  Moler 
in  that  it  distinguishes  among  the  columns  of  AA. 

THEOREM  4.1.  Let  the  vector   x  be  computed  by  Gaussian  elimination 

with  row  pivoting  and  row  scaling  where   D  =  diag(d1  ,  d_,...,  d  )  is   the 

12       n 

matrix  of  reciprocal  scale  factors.      Then  there  exists   AA  such  that 

(A  +  AA)x  =  b 
with 

||  Id"1  aa|z||  <  X(n)u||  Id"1  a|z|| 
for  arbitrary   z  >_  0  and  arbitrary   z  satisfying   0  <  |e|  <  e"  where 
X(n)  =  [19- 2n-2  -  n  -  81e2nu. 

PROOF.   See  Appendix  A.   Q.E.D. 

Note  1 .   The  factor  e    appears  in  the  Forsythe  and  Moler  book 
as  the  constant  1.01.   The  advantage  of  e    is  that  it  indicates  the 
nature  of  the  higher  order  effects  and  it  does  not  require  placing 
some  arbitrary  restriction  on  the  size  of  nu. 

Note  2.   It  is  actually  possible  to  show  that  |  A  A  [  <_  L  [  A  |  for 
some  lower  triangular  matrix  L  ,   although   the  best  possible  LR  is 
somewhat  complicated. 

With  z  =  e  in  Theorem  4.1  we  get  the  usual  type  of  bound 
Hd;1  AA||  <  X(n)u||D-1  A||  . 
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The  bound  of  this  theorem  is  practically  always  extremely  pessimistic. 
However,  there  are  cases  where  this  bound  can  be  attained  in  the  limit 
as  u  -*■  0. 

THEOREM  4.2.  There  exists  a  problem   Ax  =  b  and  a  floating  point 
arithmetic   <+,  -,  x,  />  suoh  that  the  solution   x  computed  by  Gaussian 
elimination  with  partial  or  complete  pivoting  satisfies   (A  +  AA)x  =  b 
only  for  those  matrices   AA  for  which 
AA 


>  ri9-2n  2  -  n  -  81u  +  0(u2). 

Therefore  s    the  bound  of  Theorem   4.1  is  the  best  possible  bound  in  the 
limit   u  ■+  0. 

PROOF.   See  Appendix  A  for  the  proof,  which  employs  a  modification 
of  Wilkinson's  [1963]  example.   Q.E.D. 

The  next  theorem  uses  Theorem  4.1  to  get  a  bound  on  the  backward 
error.  We  are  especially  interested  in  the  effect  of  scaling  on  the  error 
bound. 

THEOREM  4.3.  Let   I  =  {i:   ( | A| | x| ) .  =  0} .  If  d .  =  0  for   i  G  I 
and   d .  ^  0  for   i  4  1  >  then  the  backward  error 

,   s   max  |D  a| |x| 


min  (Id  A I  I x I ) . 


i#t  '     * 

PROOF.   Putting  z  =  |x|  in  Theorem  4.1  gives 
|D^(b  -  Ax)||=||D^  AAx||  <  ||  ID"1  AA||x||| 


(4.7) 


<  x(n)u  |||d£1  A|  |x|||. 

Multiplying  this  by  e  and  letting  e  -*■   0  gives 

max   |  (b  -  Ax)  .  |  _5_  X  (n)u  max   (  |  A|  |x  |  )  .  , 
i  e  I  x  i  G  I         x 

from  which  it  follows  that  (b  -  Ax).  =  0  for  i£  T.   Hence,  from  Theorem  3.2 
we  have  that 


27 


-1 


(b   -  Ax). |  (|D      (b   -  Ax)]). 


1 
max      ,  i  .  I  i  -  i ; =     max 


I  ^    I    C|A|  |Sfc|>±  .    £T    (|D"lA||ft|)i 

and    from    (A.  7)    we    see   that 

(|D_1(b   -  Ax) |) .   ^  x(n)u     max         |d-1a||x|.  Q.E.D. 

By   choosing 

d.    =    (|a| |x|).  (4.8) 

i  '  .  i 

the  bound  on  the  backward  error  is  minimized,  giving  n  <^  x(n)u«   This 
suggests  that  a  linear  system  should  be  scaled  by  dividing  each  row  by 
its  weighted  £..  norm  where  the  weights  are  the  components  of  the  computed 
solution.   Unfortunately,  (4.8)  represents  an  implicit  equation  for  the 
scale  factors  d.  because  the  computed  solution  x  is  a  function  £  (11(D))  of 
the  scaling  matrix  D;  that  is,  D  must  solve  the  equation 

D  =  diag(|A| |?(n(D))|),  (4.9) 

for  which  a  solution  may  not  exist.   The  nature  of  this  equation  becomes 
more  apparent  by  noting  that  it  is  equivalent  to  solving  for  a 
permutation  P  that  satisfies 

P  =  n(diag(|A||c(P)|)).  (4.10) 

For  if  D  satisfies  (4.9),  then  P  =  11(D)  satisfies  (4.10);  and  if  P 
satisfies  (4.10),  then  D  =  diag( | a| | £ (P) | )  satisfies  (4.9).   In  principle 
we  could  determine  the  solution  to  (4.9),  if  there  is  one,  by  testing  to  see 
if  any  of  the  n!  permutations  P  satisfy  (4.10).   In  the  cases  where  a  solution 
exists,  the  backward  error  is  bounded  by  x(n)u.   One  suspects  that  (4.10) 
almost  always  has  a  solution,  and  it  is  even  conceivable  that  (4.10)  always 
has  a  solution,  at  least  whenever  £(P)  is  defined  for  all  P.   The  existence 
of  a  solution  for  (4.10)  implies  the  existence  of  an  ordering  for  the  rows 
which  makes  Gaussian  elimination  stable. 
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If  one  wishes  to  solve  (4.10),  the  following  iteration  would 
likely  converge  and  converge  quickly  for  almost  every  system  of  equations: 
P(Q)  =  II(diag(|A|e)), 

P(m+1)  =  n(diag(|A||5(P(m))|)),  m  =  1,  2,.... 
This  is  not  suggested  as  a  practical  algorithm  though  because  poorly 
scaled  equations  can  usually  be  accurately  solved  by  doing  iterative  improvement. 

A  more  useful  application  of  Theorem  4.3  is  the  diagnosis  of 
ill-scaling,  for 


aR(A,  x)  = 


max 

A 

X 

min 

A 

X 

is  an  easily  computable  measure  of  how  badly  scaled  the  rows  are. 

Remark.   The  bound  of  Theorem  4.1  can  be  refined: 

(|D  AA|z).  <_  x(n)u  max   (|d  A|z).. 
1  ~       J  1  i  J 


This  suggests  that 
max   ( | A | | x | ) 


1 


.ii2.   71X17x1). 
i  >  j  i  1 1  i  'i 


would  be  a  better  measure  of  the  possible  effect  of  ill  scaling. 

The  quantity  aD(A,  x)  is  not  very  satisfactory  for  theoretical 
purposes  because  x  depends  on  the  arithmetic  used  in  the  computation.   We 

would  prefer  to  use  a  (A,  x)  for  the  theory.   For  Hamming's  example 

R 

|a|  |x|  =  (3  +  3e,  6e,  4e)   and  a  (A,x)  =  -f-  +  j-   .   Near  optimal  row 


R 
scaling  for  this  problem  is  given  by 


4e 


D*A- 


3   2   1 
2 


2   2 
2  -1 


,  D_1b  = 


3  +  3e 
6 
2 


For  Gear's  example  |a||x|  =  (2/e  +  2,  1,  1,  2)  and  aR(A,x) 
optimal  row  scaling  is  given  by 


=  2/e  +  2.   Near 
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£ 

E 

£ 

E 

0 

D"1A  = 

0 
0 

e 
0 

0 

£ 

0 
0 

,      D   1b   = 

1 

1 

1 

0 

0 

1 

2 

One  may  wonder  about  the  effect  of  scaling  strategies  such  as 

row  equilibration.   Van  der  Sluis  [1970b,  p.  80]  gives  an  example  showing 

"that  it  is  quite  possible. .. that  there  exists  no  bound  depending  on   n  only 

for  the  ratios  of  the  errors  after  and  before  equilibration."  He  goes  on 

to  describe  a  cautious  equilibration  scheme  that  never  worsens  the  situation 

at  the  expense  of  possibly  not  improving  it.   An  adaptation  of  this  scheme 

to  our  theory  is  to  choose 

a.  . 
ill 


d.  =  min  max  — 
i    ,    .   'a.  . '  ' 

which  has  the  effect  of  leaving  no  row  of  A  strictly  dominated  by  any  other 
row  of  A.   Note  that,  since  d.  <   1,  we  have  min|D  a| |x|  >  a| |x|.   Furthermore, 

(|a|  |x|).  =  min  Sl-^l 


i£i 
k  I   ak£'   k£    l 


<   min  max 


JJ. 


a,  „   x 


k   j    kj   £ 
<_  d  .  max  |  A  |  |  x  |  , 
whence  max | D  A||x|  <_max|A||x|.   Therefore 

min ID  A| I x 


fr^-1,*,        \        max[p  A 1  j  x  | 
aR(D  A,  x)  -  -1     |  - 


max 

A 

X 

min 

A 

X 

=  aR(A,  x), 


so  that  the  scaling  of  the  problem  is  not  made  worse  by  our  choice  of  D, 
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The  theorem  that  follows  gives  bounds  on  the  backward  error  that 
in  the  limit  u  ->  0  depend  only  on  how  ill  scaled  the  problem  is  and  not 

how  ill  conditioned  it  is.   First  we  need  a  lemma. 

LEMMA  4.4.  If   D  is  nonsingular,    then 


D"lAAx||  <X(n)u|||D-1A||x||l 

1  -  x(n)u  Cond(A  D) 


PROOF.   We  have 


x  +  A  1AAx  =  x, 


which  implies 


and 


D  -""AAx  =  (I  +  D  1AA   "4))  1D  XAAx 


iD^AAxlU—  llD_] 


1  -  || D  1AA  A  *  D|| 


The  term  in  the  numerator 

||d_1aax||  ±  ||  |d_1aa|  |x|  II 

<_  x(n)u  I    I D     A  I  |x  |  ||, 

and   the   term  in  the  denominator 

||D_1AA  A_1D||  =  HId^AA  A_1D|e|| 

±  ||  |d-1AA|  I  A_1D  |  e  |i 

<_   x(n)u  ||  |d-1a|  |A_1D|e|| 

=  x(n)u  ||  |d_1a|  |A_1d|  II  .  Q.E.D. 

THEOREM  4.5.  Let   D  be  nonsingular  and  let    |a| |x|  >  0. 

Gaussian  elimination  with  row  pivoting  gives 

X(n)u  a    (D~  A,  x) 
n  <  * 


1  -  2x(n)u  Cond(A  1D)  aD (D  1A,  x) 

R 


provided  that  the  denominator  is  positive. 

PROOF.   Choose  AA  as  in  Theorem  4.1.   From  Theorem  3.2  we  have  that 
-  AAx  | 


T)  =   max 


XfPT 
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Since  x  =  x  +  A   Ax,  we  have 

|x|  <_   |x|  +  |A~1D|e||D~1AAx||  , 
and  so 

Id_1aIIxI  1  Id_1aI|x|  +  |d_1a|  |A_1D|e)|D~1AAx|| 

Thus 

e||D-1AAx|| 
n   <   max 


|  D  1A  |  |  x  | 

e||D_1AAx 
<  max 


D  1A||x|  -  e  Cond(A  1D)||d  AAx 


Applying  Lemma  4.4  yields 

x(n)u  a    (D     A,   x)  _ 

e||D      AAx  ||  < — |D     A  I  |x|     , 

1   -   x(n)u   Cond(A     D) 

from  which  the  theorem  follows.   Q.E.D. 

Although  we  are  unable  to  prove  that  there  is  always  some 
ordering  of  the  rows  for  which  Gaussian  elimination  is  stable,  we  can  show 
that  this  is  true  asymptotically  as  u  ■*■   0. 

THEOREM  4.6.  For  any  problem  such   that    |a||x|  >  0  there  is  some 
ordering  of  the  rows  for  which  Gaussian  elimination  is  asymptotically  stable. 

PROOF.   Using  Theorem  4.5  with  D  =  diag(|A||x|)  gives 

n  < *L(iOu 


A 
1  -  2x(n)u  max  J — ' 


1||A||x| 


A|  |x 
for  small  enough  u.   Hence  for 

IaI Ixl 


u  <  u(n)  =  min 


4x(n) |a| |a  I |a| |x 


we  have 

n  <  2x(n)u.  Q.E.D. 
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We  end  this  section  by  examining  the  effect  of  scaling  on  a  good 

bound  for  the  "forward"  error. 

THEOREM  4.7.  The  error 


_  x||  <  x(n)ullA-1DH  HlD^Allxl 


1  ■  x(n)u  Cond(A_1D) 
PROOF.   We  have 

||x  -  x  ||  =  ||a_1(-  AAx)  II 

<  ||a-1d||  ||d-1aax||  , 

and  the  theorem  follows  from  Lemma  4.4.   Q.E.D. 

If  higher  order  terms  in  u  are  ignored,  the  bound  on  the 
error  is  minimized  by  choosing  D  =  diag(  |A  |  |x  |) .   Thus 

Ha"1!!    II  IaI  UN 

lllA^llAllxlH 
is  a  measure  of  the  possible  effect  on  the  "forward"  error  of  how  poorly  the 
equations  happen  to  be  scaled.   For  Hamming's  example  this  quantity  is 

fi       —  1  —l 

yy  e   +  0(1)  and  for  Gear's  it  is  e   +  0(1). 

The  usual  type  of  bound  on  the  error  is  of  the  form 

||ft  -   x||  <  x(n)u|K"1li||K'1A||||x||+  0(u2). 

This  is  minimized  by  D  =  diag(|A|e),  which  is  row  equilibration  with 

the  £  norm. 
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5.  Gaussian  Elimination  with  Column  Pivoting  and  Scaling 

This  section  is  similar  to  the  previous  section  except  that  we 

examine  the  variant  of  Gaussian  elimination  in  which  the  columns  are 

interchanged  in  order  to  ensure  that  the  pivot  element  is  the  largest  in 

its  row.   The  algorithm  is  assumed  to  do  column  scaling  where  the  scale 

factors  d..  ,  d„,...,  d  are  given  as  inputs  to  the  algorithm.   Again  the 
1    /       n 

selection  of  the  pivot  is  assumed  to  be  done    exactly  so  that 


aki   di 


(k) 


akk 


<   d.  ,    i  >  k  +  1  . 

—  '  k1      — 


Writing  the  condition  in  this  form  allows  for  the  use  of  zero  scale  factors 

but  does  not  permit  the  selection  of  a  zero  pivot. 

An  a  priori   error  bound  is  given  by  the  following  theorem: 
THEOREM  5.1.  Let  the  vector   x  be  computed  by  Gaussian  elimination 

with  column  pivoting  and  column  scaling  where   D  =  diag(d  ,  d  ,...,  d  )  is 

the  matrix  of  scale  factors.      Then  there  exists   A A  such  that 
(A  +  AA)x  =  b 

with  |AAD)e  <_   x(n)u|AD|e 

where  x(n)  =127*2    -  5n  -  7je 

PROOF.   See  Appendix  B.   Q.E.D. 

It  is  likely  that  the  constant  x(n)  in  this  bound  could  be 

replaced  by  a  smaller  constant. 

The  following  theorem  indicates  how  the  columns  should  be  scaled 
in  order  that  Gaussian  elimination  with  column  pivoting  be  stable. 

THEOREM  5.2.  Let   x  be  the  value  of  A     b  computed  by  Gaussian 
elimination  with  column  pivoting  and  column  scaling  where   D  =  diag(d  ,  d~,...,  d  ) 
is   the  matrix  of  scale  factors.      Let    I  =  {i:  (|a| |x|).  =  0}  and  let 
J={j:x=0}.  If  d.   =  0  for   j  GJ  and   d.  f   0  for   j  $  J ,    then  the  backward  error 

•J  J  J 
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(|AD|e), 


n  <  x(n)u  max  ,i  .  i  i^i?-  max  (|d  x|)  .. 

PROOF.   We  have  x  =  DD~  x  where  D  denotes  D  with  all  diagonal 
zero  entries  replaced  by  ones.   Hence 

|b  -  Ax |  =  |AAx|  £  |AAD|e  Hd"1*!! 


£  X(n)u  |AD|e  |  JD^x 


-1. 


=  x(n)u  |AD|e  max  (|d  x|).. 
&  J 

We  have  that  (IaIIxI).  =  0  implies  that  a..  =  0  for  i  ^  J,  which  implies 
1111  l  ij  '  r 

that  (lADle).  =  0.   Hence  the  theorem  follows  from  Theorem  3.2.   Q.E.D. 
COROLLARY.  The  backward  error 
C |d_1x| ) . 

n  _<  x(n)u  max  - "•-  . 

i,j<fj  (|D_ix|). 
l 

PROOF.   We  have 


e  <_  |D  x|  max  — : .  Q.E.D. 

rfJ  (Id^xI). 


A  choice  of  D  which  minimizes  the  bound  on  the  backward  error  is 

d  .  =  x . ; 
l    l 

that  is,  we  scale  by  multiplying  the  i-th  column  by  the  i-th  component  of 

the  computed  solution.   Again  these  weights  are  not  known  at  the  time  when 

scaling  is  performed.   The  main  value  of  this  theorem  is  that  it  gives  an 

easily  computable  measure  of  column  ill  scaling: 

~  /a   ~\       [  A]  e  l|x|| 
ac(A,  x)  =  max  ^J^f     • 

For  theoretical  purposes  we  would  prefer  to  use  a  (A,  x) .   For  Hamming's 

1    2 
example  a  (A,  x)  =  -r—  +  — .   Near  optimal  row  and  column  scaling,  which 

would  be  appropriate  for  complete  pivoting,  is  given  by 
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DIlAD2  = 


3e  2  1 
2  2  2 
12-1 


For  Gear's  example  an(A,    x)  =  — . 
is  given  by 


&• 


3  +  3e 
6 
2 
Near  optimal  row  and  column  scaling 


DllAD2   = 


dA  = 


E       1       -1       E 

0      10     0 
0     0        10 
10        0      1 
LEMMA  5.3.      If  D  is  nonsingular 3    then 

lAAftl    <   *(n)u    |AD|e  \\vT\\\ 
|A       '    -  1  -   x(n)u  Cond(AD) 

provided  that  the  denominator'  is  positive. 

PROOF.      We   have 

x  +  A_1AAx  =   x, 


which   implies 


Hence, 


AAx  =   AAD(I  +  D     A   1AAD)    1   D   1x. 


IaAxI    <    lAADle  II  (I  +  D-1A   "''AAD)    1  D     x] 


-1 


X(n)u    |AD|e  ||D      x|| 

l  -  ||d"1a"1aad|| 


The   term  in   the  denominator 


|D   1A   1AAD||^|||D   """A   1 1  |  A  AD  |  e  1 1 

:   x(n)u  II  |D_1A~:L  I  |AD|e 


=   x(n)u  II  ID   "'"A   1|  I AD 


Q.E.D. 


36 


THEOREM  5.4.  Let   D  be  nonsingular  and  let   |a||x|  >  0.  Gaussian 

elimination  with  column  pivoting  gives 

X(n)u  a  (AD,  D_1x) 
n  < 


1  -  2x(n)u  Cond(AD)  a  (AD,  D  """x) 


provided  that  the  denominator  is  positive. 

PROOF.   Choose  AA  as  in  Theorem  5.1.   From  Theorem  3.2  we  have  that 

-  AAxI 
n  =  max   I  I  I „  I   . 
|A| |x| 

Furthermore 

x  =  x  +  A  AAx, 


and  so 


IaMxI  >  I A I  I  x  I  -  I A I  I A  1||  AAx  I 


Therefore 

I  AAx 


n  <  max 


I A I  I  x  I  -  I A II A  -1 1  I  AAx  I 


Applying  Lemma  5.3  yields 


a  0    D  Xx) 


AAv  I  < I  A  I  I  v  I 

AAX|  -  1  -  v(n)u  Cond(AD)  lAHxl» 


X1 

from  which  the  theorem  follows.   Q.E.D. 

THEOREM  5.5.  For  any  problem  such  that    |x|  >  0  there  is  some 
ordering  of  the  columns  for  which  Gaussian  elimination  is  asymptotically  stable. 
PROOF.   Using  Theorem  5.4  with  D  =  diag(|x|)  gives 
X(n)u  


n  < 


A 
1  -  2x(n)u  max  J — 


-1 


| A | |x| 


for  small  enough  u.   Hence  for 
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u  <  u(n)  =  min 


-1 


4x(n)|A  x||A||x| 
we  have 

n  <_  2x(n)u.  Q.E.D, 

Recall  that  the  stability  threshold  in  the  case  of  Gaussian 
elimination  with  optimal  row  ordering  was 

lA||x| 


u(n)  =  min 


-1 


4x(n)|A| |A  X| |A| |x| 

It  is  easy  to  show  that  this  is  larger  than  the  stability  threshold  for 
optimal  column  ordering.  This  is  a  slight  indication  that  row  oivotine 
may  be  superior  to  column  pivoting. 


we   have 


THEOREM  5.6.      The  error 

||£  _   x||  <    xCnHllA-'llADllHlD^xll 
II*  l!  -        1   -   x(n)u   Cond(AD) 

PROOF.      Since 

(A  +  AA)(x  -   x)    =   -   AAx, 


x-x  =   -(I+A   1AA)    1  A   ^"AAx 

■   -   A_1AA(I  +  A~1AA)"1x 

=  -  A-1AAD(I  +  D~1A~1AAD)~1 


D  1x. 


Therefore, 


x  -  x  < 


1  -  i | D  1A  1AAD|| 

1a"1!  [aadU  Itsliidl 
i-|!  !d-1a_1I  IaadU 


<   X(n)u  H  |A   11  [API  II  Hd"^ 
-  1   -   x(n)u  Cond(AD) 


Q.E.D. 
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If  higher  order  terms  in  u  are  ignored,  the  bound  on  the  error 
is  minimized  by  choosing  D  =  diag(|x|).   Thus  for  column  pivoting 

HlA-'llA »|| 

HlA^llAllxlll 

is  a  measure  of  the  possible  effect  of  ill  scaling  on  the  "forward"  error. 
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6.  Iterative  Improvement 

It  is  often  thought  that  iterative  improvement  is  not  worthwhile 

unless  either  (i)  the  uncertainty  in  the  values  of  A  is  less  than  the  unit 

roundoff  error  (e.g.,  if  the  elements  of  A  are  integers)  or  (ii)  we  wish  to 

diagnose  ill  conditioning.   This  thinking  is  based  on  the  fact  that  Gaussian 

elimination  with  pivoting  is  stable  from  the  absolute  error  point  of  view. 

But  according  to  the  relative  error  point  of  view,  Gaussian  elimination  may 

not  give  acceptable  accuracy,  and  so  it  is  of  interest  to  examine  the 

stability  behavior  of  iterative  improvement.   Results  of  a  careful  error 

analysis  are  given  for  iterative  improvement  both  with  and  without  double 

precision  accumulation  of  the  residuals. 

The  algorithm  being  considered  is  described  as  follows  where 
subscripts  denote  iterates  rather  than  components  of  vectors: 

x..  =  value  of  A  b  computed  by  row  pivoting, 

for  m  =  1,  2,  3, .  .  . 

r  =  computed  value  of  b  -  Ax  , 
m  m 

d  =  value  of  A  r  computed  by  row  pivoting, 
m  m  y 

x  , ..  =  x  +  d  . 
m+1    m    m 

The  algorithm  for  computing  r  appears  in  §3  and  the  algorithm  for  d   is  in  §4, 
The  theorem  which  follows  shows  that  just  one  iteration  of  iterative 
improvement  with  just  single  precision  accumulation  of  the  residuals  is 
enough  to  make  Gaussian  elimination  asymptotically  stable.   This  may  seem  to 
contradict  the  usual  advice  (Forsythe  and  Moler  [1967,  p.  49])  that  "It  is 
absolutely  essential  that  the  residual  r  be  computed  with  a  higher  precision 
than  that  of  the  rest  of  the  computation."  Actually  there  is  little  conflict 
because  we  have  shown  that  poorly  scaled  systems  may  be  solved  with  an 
effective  precision  of  much  less  than  single  precision. 
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THEOREM  6.1.  Assume  that   |a||x|  >  0.  Gaussian  elimination 

followed  by  one  iteration  of  iterative  improvement  results  in  a  backward 

error  which  satisfies 

nl  1  (n+l)u  +  {(X(n)2  +  2nx(n)  +  n2  +  2n)  Cond(A_1)  a  (A,  x) 
^  R 

+  X(n)  aR(A,  x)  +  |  n2  +  |  n}u2  +  0(u3) 

for  single  precision  residual  accumulation  and 

n£  <  u  +  {X(n)2  Cond(A_1)  oR(A,  x)  +  X (n)  aR(A,  x)  +  n}u2  +  0(u3) 
for  double  precision  residual  accumulation.      That  is3    the  algorithm  is 
asymptotically  stable  in  either  case. 

PROOF.   The  theorem  follows  from  Theorem  C.9  in  Appendix  C.   Q.E.D. 

Note.   The  asymptotic  nature  of  these  bounds  conceals  the  fact  that 
certain  assumptions  on  the  smallness  of  u  are  necessary  in  order  to  get  any 
bound  at  all.   The  actual  assumptions,  found  in  Appendix  C,  are  too  lengthy 
to  reproduce  here;  roughly  speaking  it  must  be  assumed  that  the  coefficient 
of  the  second  order  term  is  less  than  1/u. 

Recall  from  Theorem  4.3  that  for  no  iterations  we  have 

2 
n  5  x(n)  °R(A>  x)u  +  °(u  )» 

and  thus  one  iteration  does  make  a  big  difference  in  the  size  of  the  bound. 

However,  the  presence  of  the  product  Cond(A   )  a  (A,  x)  in  the  second  order 

term  indicates  that  this  may  not  be  the  case  for  problems  which  are 

sufficiently  ill  scaled  and  ill  conditioned. 

THEOREM  6.2.  Assume    |a||x|  >  0.  The  backward  error   n  for 

iterative  improvement  of  Gaussian  elimination  with  row  pivoting  satisfies 

Tim  n'  <  (n+l)u  +  (2n(x(n)  +  n+1)  Cond(A-1)  aD(A,  x) 
m  —  k 

+  X(n)  aR(A,  x)  +  \   n2  +  |  n  +  l}u2  +  0(u3) 

for  single  precision  residual  accumulation  and 


lim  n"  <  u  +  (x(n)  an(A,  x)  +  n+l}u  +  0(u  ) 
m  —  K 


for  double  precision  residual  accumulation. 
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PROOF.   The  theorem  follows  from  Theorem  C.7  in  Appendix  C.   Q.E.D, 
The  main  effect  of  doing  more  iterations  with  single  precision 
accumulations  is  a  moderate  reduction  in  the  magnitude  of  the  second  term. 

But  for  the  double  precision  case  there  is  a  striking  improvement  due  to 

2-1  2 

the  disappearance  of  the  x (n)   Cond(A   )  a  (A,  x)u   term  so  that  the  bound 

K 

-1  3 

on  n"  depends  on  the  condition  number  of  A  *  only  through  the  0(u  )  term. 

m 

This  may  represent  a  significant  improvement  for  problems  which  are  both 
poorly  scaled  and  ill  conditioned. 
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7.  Practical  Implications 

The  comments  that  follow  are  suggested  by  the  error  analysis, 

but  their  usefulness  remains  to  be  established. 

By  means  of  examples  it  has  been  shown  that  Gaussian  elimination 

with  (partial  or  complete)  pivoting  does  not  generally  provide  all  the 

accuracy  that  the  data  deserve  or  even  a  fixed  fraction  of  that  accuracy. 

Hamming  [1971,  p.  121]  states 

"It  is  reasonable  to  ask  how  typical  these  examples  are  and 
how  often  in  the  past  the  pivoting  method  has  created  the 
ill  conditioning  that  was  reported  to  occur  by  some  library 
routines.   The  answers  are  not  known  at  this  time;  all  that 
is  claimed  is  that  textbooks  and  library  descriptions  rarely, 
if  ever,  mention  this  possibility  (though  it  is  apparently 
known  in  the  folklore)." 

and  so  it  seems  that  there  have  been  practical  instances  where  the  pivoting 

method  has  performed  poorly.   Perhaps  Gaussian  elimination  without 

iterative  improvement  should  be  regarded  as  a  "quick  and  dirty"  way  to 

solve  linear  equations. 

The  computation  of  the  backward  error  is  one  reliable  test  for 

deciding  whether  or  not  the  solution  of  a  linear  system  is  "reasonably 

accurate."   The  test  can  be  made  quite  efficient  by  accumulating  r  and 

|a| |x|  +  |b|  at  the  same  time.   If  the  test  is  failed,  then  in  most  cases 

the  use  of  iterative  improvement  would  result  in  a  solution  which  passes 

the  test.   One  could,  of  course,  forgo  the  backward  error  computation  and 

just  do  iterative  improvement  until  "convergence."   But  such  a  procedure 

may  not  be  completely  reliable  since  it  has  not  been  rigorously  proven  that 

"convergence"  implies  a  reasonably  accurate  solution.   Stewart  [1973, 

p.  205]  mentions  "the  possibility  that,  with  a  violently  ill-conditioned 

matrix,  the  iteration  may  appear  to  converge  to  a  false  solution." 
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It  is  also  suggested  by  the  theory  that  if  double  precision 
accumulation  of  the  residuals  is  costly,  then  iterative  improvement  with 
single  precision  accumulation  might  still  be  beneficial. 

The  success  of  the  pivoting  method  depends  upon  a  reasonable 
scaling  of  the  equations,  which  is  at  best  guesswork  unless  one  has  some 
knowledge  about  the  sizes  of  the  solution  components.   If  c  ~  |x|,  then 
(i)  for  row  pivoting  one  should  scale  the  system  to  get 
(D~1A)x  =  (D^b)  where  D  =  diag(|A|c). 
(ii)  for  complete  pivoting  one  should  scale  the  system  to  get 

(D~  AD2)(D~1x)  =  (D^b)  where  D  -  diag(|A|c)  and  T>2  =   diag(c) 
It  may  be  worthwhile  to  allow  users  of  a  linear  equation  solver  to  provide 
an  estimate  of  the  solution,  particularly  if  the  solution  variable  X  is  being 
used  only  as  an  output  parameter.   For  simple  use  of  the  program  X  could 
be  set  to  all  ones. 
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Appendix  A.      Error  Bounds  for  Row  Pivoting 

For  Lemmas  A.l  through  A. 8  it  is  assumed  that  D  has  nonzero  diagonal 
elements.   This  assumption  does  not  apply  to  the  theorem  that  follows 
these  lemmas.   For  any  n  x  n  matrix  C  =  (c  .  .  )  let  c.  =  c  .  . /d  .  .   Also, 


iJ 


iJ 


let  u)  =  1  +  u. 

LEMMA  A.l.  We  have 


and 


m..   <  a)  d .  /d,  ,     i  >  k, 
ik'  —   '  i   k1 


ij   i 


-(k)i    k+1  k_1  ,„  ,k-l-£,-   |    k-1,- 
a.  .    <  a)     I      {2b))  a.  .   +  a)    a 


ij 


i,j  :•  k. 


I    i  (k)  ,  (k) 
miJ  ^K   /akk 


PROOF.   Equation  (4.1)  implies 

i  >  k, 

and  because  of  row  pivoting  (see  (4.5))  we  get  the  first  inequality  of 
the  lemma.   Equation  (4.2)  implies 


|a<k+1> 
1  ij 

and  therefore 

|a(k+1) 
1  iJ 


<_oo|a 


(k) 
ij 


+  (1  +  ZiOlm.,  a^l,    i,j  >  k  +  1 
1  ik  kj  '       J  — 


<  Jafk)|  +  u(l  +  2u)|a1(k)|, 
-    ij  kj 


i,j  >  k  +  1, 


The  second  inequality  of  the  lemma  follows  from  this  by  induction  on  k.   Q.E.D. 
LEMMA  A. 2.  The  matrices   L  =  {I..)   and   U  =  (u..)  satisfy 

LD  =  A  +  E(1)  +E(2)  +...+  E(n"^ 

(k)  (k) 

where   the  matrices   E    have  elements   e..  which  satisfy 


ij 


-u>  ^laf^l  +  (2  +  3u)  a)  2u|m..a.u      for   i,j  >  k  , 

ik  kj 


(k) 


I.«l<< 


-1  I 
to  u  a 


(k) 
ik 


L   0 

regardless  of  the  pivoting  strategy. 


for   i  >  j  =  k, 

otherwise. 
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PROOF.   Define  the  elements  of  E 


(k) 


by 


r  (k+1)     (k)       (k) 
a  .  .    -  a .  .   +  m.,  a,  . 
ij       ij     ik  kj 


.  *(k)    *(k) 
mikakk  "  aik 


^' 


for  i,j  >  k, 
for  i  >  j  =  k, 
otherwise. 


By  separately  considering  the  cases  i  <_  j  and  i  >  j  it  is  straightforward 

(k)      (k) 

to  show  that  the  elements  e . .   of  E    satisfy 


n-1 


(k) 


n 


E   e;?'  -   £   £.,  u  .  -  a.  .  , 
k=l   «     k=i   lk  ^     ij 

which  establishes  the  equality  of  the  lemma.   Let  k  <^  n  -  1  be  fixed. 
Write  (4.1)  as 


™Ik=  C-S^Hl  +  V- 


i  >  k .+  1, 


and  (4.2)  as 

a(k+1)   =    (a00  -  m.,a<k)(l  +  6..))(1  +  6'..),  i,j  >  k  +  1, 
ij        ij     ik  kj       ij        ij       - 

where  the  6's  are  relative  roundoff  errors.   Then 


^a^V   -  m..  a,(k)(6..  +6'..  +6..  6'..)   for  i,j  >  k, 
ij   ij    xk  kj    xj    xj     xj   xj 


(k) 


U;  ..  /   UK 

J4    -  \   a-,-U   6 


IJ 


ik  ik 


Lo 


for  i  >  j  =  k, 


otherwise, 


and  the  lemma  follows  from  the  bound  u/(l  +  u)  on  the  6's.   Q.E.D. 
LEMMA  A. 3.  The  matrices   L  and   U  satisfy 
LU  =  A  +  E 

with 


-1 


|D      E|z||  £   (3-2 


n-1 


3)o2n-1xi\\v~1*\z 


for  arbitrary   z  _>  0, 
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PROOF.   Let  E  -  E(1)  +  E(2)  +. . .+  E(n  "^  where  the  E(k)  are  given 

by  Lemma  A. 2.   Substituting  the  bound  on  m.,  of  Lemma  A.l  into  the  bound 

xk 

(k) 

on  the  elements  e . .   we  get 

ij 

,-(k),     -1  ,-(k),    ,        -1  ,-(k), 
£..    <  w  u  a..   +  (2  +  3u)oj  u  a,  .   , 
1  ij  '  -     '  ij  '  kj 

which,  in  fact,  is  valid  for  all  i,j.   This  implies 
E  |e<k)|z.  <  3d||n-1A(k>|«|| 

for  arbitrary  z  >^  0.   From  Lemma  A.l  it  immediately  follows  that 
||  |D-1A(k)  |  Z||  <  2k-1a»2k-11||D"1A|Z||;  and  so  we  h* 

HlD^E^IzIl!  3-2krla."2n-1«|||D"1A|z||, 

from  which  the  lemma  follows.   Q.E.D. 
LEMMA  A. 4.  We  have 


lave 


and 


£.  .   <  to  d./d   , 
ij       i  3 


-   I  „   2i-l     r.i-£-lir   , 
u  .  .   <  oo      E    2       a, .  , 


i  >  J, 


i  <   3- 


Jl=l 


(i) 


PROOF.   Follows  from  Lemma  A.l  because  £..  =  m. .  and  u..  =  a..  .   Q.E.D, 

ij    ij      ij    ij 

LEMMA  A. 5.  The  vector   x  computed  by    (4.4)  satisfies    (U  +  5U)x  =  y 
for  some  upper  triangular  matrix   6U  such  that 


where 


6u..|  <  g(i,i)oj    |u..|  and   lu..  +  6u..|  < 


r 


n-i+li 
to      u 


ij 


g(i,j)  =  < 


n  -  j  +  2,     j  >  i  +  2, 

n  -  j  +  1,    j  =  i  +  1, 

2,  j  =  i  £  n  -  1, 

xj-,  j  -  i  =  n, 

regardless  of  the  pivoting  strategy. 
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PROOF.   From  (4.4)  we  have  that 


xi   x 


(1+6 

.)(l+6^) 

r    v  • 

.  V,U 

and 

u 
nn 

1  + 

X 

n 

6' 

n 

yn 

Xv     +  u       xx    ) 

i,i+l    i+l    i,i+2    ±+2J 


.+  u.   x  x  ) 
xn    n 


=  y. 


i  <  n  -  1 


where  6.  and  6!  are  relative  roundoff  errors  due  to  subtraction  and  division, 
x      x 

respectively.   By  the  usual  type  of  analysis  we  can  obtain  the  bounds 

r 


6u 


ij 


1 


.   n-j+2    ,  I    | 
(u)  J    -  l)|u   |, 

n-j+1      |    | 
(u>   J    -  1)  |u.  .  |  , 


(u)   -  1)  U 


1J 


(O)  -  1)  u.  .  , 


V. 


1J 


j  1  i  +  2, 

j  =  i  +  1, 

j  =  i  <  n  -  1, 

j  =  i  =  n, 


from  which  the  lemma  follows.   Q.E.D. 

LEMMA  A. 6.  The  matrix   6U  of  Lemma   A. 5  satisfies 

|||d_1l6u|z||  <  f5.2n"2  -  21  m^Hd^aUII 

for  arbitrary   z  _>  0. 

PROOF.   From  Lemma  A. 4  it  follows  that 

(|d~1L6u|z).  <  Z   E  IdT1  A.,  d,  I  |6u,  .  |z. 
x  —  .  .    x    xk  k   '   kv  i 
J  k 

<  (jo  E  E   6u,  ,  z  .  . 
"   J  k    kj   J 

Applying  Lemma  A. 5  first  and  then  Lemma  A. 4  gives 
(|D_1L6u|z).  <  u  E  E  g(k,j)wn~k+1|u,.|z. 


j  k 


j1  j 
k 


<u  E   E   g(k,j)Wn+k  E   f2k"   ll|5f.|s, 
j=l  k=l  £=1  ^   J 


<»   u   E    E    E   r2k"£_1l  g(k,j)|i   |z 
1=1   j=£  k=£  J   J 
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After  much  manipulation   it   turns   out   that 

max        E      U1^"1!    g(k,j)    =    15- 2n"*"2], 
£.<j<n  k=£ 

and  therefore 

(|D_1L6u|z).  <  w2n  u  E   L  5-2n~Z~2\    (|d_1a|z)  , 
1         £=1  l 

from  which  the  lemma  follows.   Q.E.D. 

LEMMA  A. 7.  The  vector   y  computed  by   (4.3)  satisfies 

(L  +  6L)y  -  b 

for  some   lower  triangular  matrix  6L  whose  elements   61..   satisfy 

\6l..      <  min   {n  -  j   +  1,    n  -   l}u>n~^u  I  a  9  '  U:\  )  I  >  i   >   J. 

1      ij1    -  '    ij        jj    '  - 

regardless  of  the  pivoting  strategy. 

PROOF.      We  have   from    (4.3)    that 

(...Uu   *  yt  +  la  *  y2)...+  t  *  T^ltij--  b.,  i  »   2 


where  6 .  is  a  relative  roundoff  error.   By  the  usual  type  of  analysis 

we  £,et  that 

-1  vi-l 


6 


Z±1\    <  ((1  +  a)  u)    -  1)U±1  ,     i  1  2, 

-1  vi-j+1 

+  0)   u)   J 


and 


62-..   <  ((1  +  a)  u)   J    -  l)\i..\y  i  >  j  >  2, 

ij '  -  '  il  '  - 


6fc. . I  <  ult. . 
n1  —   '  li 


The  lemma  follows  from  the  inequalities 

(1  +  of  u)   -  1  <_  u)~  u  k(l  +  u  u) 

.  k-2 
<  koj   u 


and 


£..|  <  min  {u,  (/  j  }  I  a9  )  /a^  *  I  .   Q.E.D, 
ij1  -  iJ   JJ 
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LEMMA  A. 8.  The  matrices   6U  and   6L  of  Lemmas   A. 5  and   A. 7  satisfy 

HlD^LCU  +  6U)|z||  <  (2n+1  -  n  -  3)a)2n^|  |  D_1a|  z  \\ 
for  arbitrary   z  _>  0. 

PROOF.   Using  the  bound  of  Lemma  A. 7  and  the  row  pivoting 
inequality  (4.5),  we  get 

|6£..|    <  min{n  -  j   +  1,    n  -   l}a)   _J  u  I  d  .  /d  .  I  ,  i  >  j. 


Hence 


•>       J-XT     /TT       _L       XTT\     I    _\  ^        ..        T       ^ ,' ^  /  „  -I        -L       1  „  1    \,  ,D       3    /  I  W^ 


(|D      <5L(U  +  6U)|z).    <_  u   E   min{n  -   j    +  1,    n  -   l}co      J(|D      (U  +  6U)  |  z)  . 

j  J 

It    immediately   follows   from  Lemma  A. 5   that 

(|D_1(U  +  6U)|z).    <   a)n":i+1(|D"1u|z). 

and  from  Lemma  A. 4  that 

(|D_1(U  +  6U)|z).    <   03n+j    2j_1]||D"1A|z||. 


n 


_1xt  /«  ^  x„nI^   -  ..2n..   v   _,•_/_    4  j.  i   „    inJ-^ll^-1 


Therefore, 

(|D-J-6L(U  +  6U)|z)  .  £  c/nu  E   min{n  -  j  +  1,  n  -  1}2J_I)|  |d~Xa|z||, 
1_      3=1 

from  which  the  lemma  follows.   Q.E.D. 

THEOREM  4.1.  Let  the  vector   &  be  computed  by  Gaussian  elimination 
with  row  pivoting  and  row  scaling  where   D  =  diag(d  ,  d  ,...,  d  )  is   the  matrix 
of  reciprocal  scale  factors.      Then  there  exists   A A  such  that    (A  +  AA)x  =  b 
with 

II  [D~1AA|  Z  If  <  x(n)^||D~1A|z|| 

for  arbitrary   z  _>  0  and  arbitrary   e  satisfying   0  <  |e|  <  e  where   x(n)  = 

r    0n-2        ,  2nu 
1 19*  2    -  n  -  8  le 

PROOF.   The  restriction  on  e  implies  that  5(II(r»  ))  =  K  (11(D)),  and 

from  Lemmas  A. 3,  A. 6,  and  A. 8  we  have  the  bounds 
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-1 


,n-l 


2n-l 


.-1, 


D;Ae|z||<    (3.2"-1  -    3)^-^110^1  z||, 
D^LfiuHl    <    [5.2n~2   -    21a,2nu|||D-1A|z||, 


and 


||  |D  X6L(U  +  6U)  |z||  <  (2n+1  -  n  -  3)u)2r\J|  |d  "'"A  |  z||. 
The  theorem  follows  from  the  equation 

(A  +  E  +  6LU  +  (L  +  6L)6U)x  =  b.   Q.E.D. 

THEOREM  4.2.  There  exists  a  problem   Ax  =  b  and  a  floating  point 
arithmetic   <+,  -,  x,  J>  suoh  that  the  solution   x  computed  by  Gaussian 
elimination  with  partial  or  complete  pivoting  satisfies   (A  +  AA)x  =  b 
only  for  those  matrices   AA  for  which 


AA 


>  [19-2n  2  -  n  -  81u  +  0(u2) 


Therefore,    the  bound  of  Theorem   4.1  is   the  best  possible  bound  in  the 
limit   u  •*  0. 

PROOF.   Obvious  for  n  =  1.   Assume  n  >  2.   Let 


and 


A  = 


M 

-M  M 


-M  -M 


o  : 


M   1 
-M   1 


1  +  (7.2k  X  -   k  -  8)u, 


(  1  +  (2n+1  -  n  -  7)u, 
1  +  (19.2n_2  -  n  -  8)u, 


k  <_  n  -  2, 
k  =  n  -  1, 
k  =  n. 

If  M  is  large  enough,  then  there  are  no  interchanges  even  with  complete 

pivoting.   We  have 

m. .  =  m  =  M/(-M),     i  >  j  , 

a(1)  -  1, 

in 

(k)  .   (k-1)  ,      (k-1) 

a.    =  a .     -  m  x  a,  _    , 

in     in  k-l,n, 


i  >  k  >  2. 
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Suppose  that  all  these  floating  point  operations  increase  the  magnitude 
of  the  result  by  a  factor  (1  +  2u)/(l  +  u) .   Then 


and 


m  =  -(1  +  u)  +  0(u  ) 


00  -  tt  +  u)  a^'-H  (1  +  3U)  a/":"  +  0(u2). 


in  in 

By  induction  on  k  it  follows  that 


k-l,n 


and  hence 


We  have 


a(k)  =2kl  +  2k(k  _  1)u  +  0/u2j     i  >  k, 

in  — 


u.   =  2k  X  +  2k(k-l)u  +  0(u2) 
kn 


where 


Then 


and 


yl  =  bl 

yk=\"Sk-l'     k^2' 


Sk  =  ( .  .  .  (m  *  y;L  +  m  *  y2)  .  .  .  +  m  *  yfc)  . 


S   =  m  *  b 


S.  =  S.  .  +  m  *    (b.  -  S.  ,)»    2  <  k  <  n  -  1. 
k    k-1         k    k-1        —   — 

Suppose  that  all  these  floating  point  operations  reduce  the  magnitude 
of  the  result  by  a  factor  1/(1  +  u) .   Hence 


S1  =  -b1  +  0(u  ) 


and 


S  =  (2  -  3u)S,  1   -  (1  -  2u)b  +  0(u  ),     2  <_   k  <_  n  -  1. 

By  induction  on  k  it  follows  that 

k    k+1  9 

Sk  =  1  -  2  -  2    (k  -  4)u  -  (k  +  9)u  +  0(u  ),    k  <  n  -  2 


and 


S   ,  =  1  -  2n  1  -  2n  2(4n  -  19)u  -  (n  +  8)u  +  0(u2) 
n— l 
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Also  we  have 


y  =  1  -  2u, 

yk  =  (bk  "  Sk-l)(1  "  U  +  0(u2))'    k  1  2- 


From  this  we  get 


yk  =  2k_1  +  2k(k  -  2)u  +  0(u2),    k  <  n  -  2, 
y  -  =  2n"2  +  2n~2(2n  -  5)u  +  0(u2), 
y  =  2n_1  +  2n_1(2n  -  l)u  +  0(u2). 


We  have 


x  =  y  /u   , 
n    n  nn 

Vl  =  <Vl  *  Vl.n  *  V*' 

Suppose  that  all  these  floating  point  operations  reduce  the  magnitude  of 

the  result  by  a  factor  1/(1  +  u) .   Then 

x  =  (y  /u   )(1  -  u)  =  1  +  0(u2), 
n     n  nn 


n-2  .  „n-2,„    ^s   ,  n,   2, 
n-l,n    n    n-l,n 
2, 


u  ,    *  x_  =  u   n  _(1  -  u)  =  2 "  *  +  2    (2n  -  5)u  +  0(u  ), 


and 


Hence 


x    =  0(u  ), 
n-1 

0  +  u,    *  x  =  u,   (1  -  2u)  =  2k_1  +2k(k  -2)u  +  0(u2),    k  <  n  -  2, 

k,n  n    k,n  — 


x  =  0(u2),    k  <_   n  -  2. 


(b  -  Ax)   =  b  -  1  +  0(u2)  =  (19-2n  2  -  n  -  8)u  +  0(u2). 
n    n 

No  matter  how  we  choose  AA  we  get 

II  I AAl  |x|  II  >  ||AAx||  =  ||b  -  Ax||  =  (19-2n_2  -  n  -  8)u  +  0(u2).   Q.E.D. 
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Appendix  B.      Error  Bounds  for  Column  Pivoting 


For  any  matrix  C  =  (c..)  let  c.  =  c..d..   Also,  let  to  =  1  +  u, 
LEMMA  B.l.  We  have 


and 


-u^li-l^l.     i>k.i>k. 


,-(k),  „  k+1  kv1  ,0  ,k-l-£i-   |    k-1,-   ,      ,  4  v  , 
a..   <u     E   (2w)      a..   +o)    a..,    i.j  >  k  . 

13   -      £=1  '•  i*'         1J 

PROOF.   Equation  (4.1)  implies 

I     00,  i     I  (k)  (k)J  ,  (k)  , 

•mikakj  djl  1  "laik  akj  dj/akk  I'     1  >  k,  J  >  k. 

and  because  of  (A. 5)  we  get 

Im^a^d.l  <  wlafj^d.  I,     i  >  k,  j  >  k, 
1  lk  kj   J '  —   '  lk  k '  J  — 

which  proves  the  first  inequality.   Equation  (4.2)  implies 

l  (k+1)  |    i  (k)  |    ,.        '     N I     (k)  I 

a..     <  wa..   +  (1  +  2u)  m..a.  .   ,     i,j  >  k  +  1, 
1  ij    '  -  »  ij  '  '  lk  kj  " 

and  therefore 

|afk+1)|  <Jafk)|  +  a)(l  +  2u)|afk)|,     i,j  >  k  +  1. 
1  ij    '-  '  ij  '  /]  lk  "        - 

The  second  inequality  of  the  theorem  follows  from  this  by  induction  on  k.   Q.E.D, 

LEMMA  B.2.  The  matrices   L  and   U  satisfy 

LU  =  A  +  E 
with 

|ED|e  <    L7-2n"2  -  n-2J  co2nU|AD|e. 

PROOF.   Assume  n  >   2.   Let  E  =  E^1'  +  E(2^  +. . .+  E^n_1'  where 

(k) 
the  E    are  given  by  Lemma  A. 2.   Substituting  the  bound  on  m.,  of  Lemma  B.l 

(k) 
into  the  bound  on  the  elements  e . .   we  get 

ij 

r   i-(k) |    _   i-(k) I  ,    ,  ,  ^  i 

u|a_  |  +  2a)u|a.   |  for  i,j  >  k, 

for  i  >  j  =  k, 

otherwise. 
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This   implies 

Z|efk)|    <    (2n   -    2k+l)um|ifk)|    +  "        &       |a(k)|,  1  >   k, 

«      ij  '    xk    '  .    ,  ,     '    ii    ' 

J        J  J=k+1        J 

and   from  Lemma  B.l   it   follows   that 

l|e<k)|    <    (2n  -  2k+l).2ku{^  Z^1"*^    +  \ij) 

f  ..  2k-i  k:1  _k-i-£ i -  i     k-i     J  i-  i 

+    (n-k)io  u     E      2  a..      +  oj        u        E        a.. 

£=1  "  j-k+1     ^ 

r(2n-l)w2u  E  la.  .  I  if  k  =   1, 

<  <  i     1J 

L(3n  -   3k+l)o)     u   2  E  |  a.  .  |         if  k  >    2. 

j   1J 
The  lemma  follows  because 

n-1  _ 

(2n-l)  +  Z   (3n  -  3k+l)2    =  7-2n   -  n-2.   Q.E.D. 
k=2 

LEMMA  B.3.  We  have 

\l .  .   u..|  <  ulaH  I  i 
1  ij   ij  -    ij 

IJ  A=l  1J6 

and 


u .  .   <  u. .  I  >    i<  i. 
i  iji  -  i  ii"      - 

PROOF.   The  first  two  inequalities  follow  immediately  from 
Lemma  B.l,  and  the  third  inequality  is  a  consequence  of  column  pivoting.   Q.E.D, 
LEMMA  B.4.  The  matrix   SU  of  Lemma   A. 5  satisfies 
|L  6U  D|e  <  (2n+1  -  n-2)a)2n  u|AD|e. 
PROOF.   Applying  the  inequality  |u..|  <_  |u..|  of  Lemma  B.3  and 


the  bounds  of  Lemma  A. 5,  we  get  for  i  <  n 
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n 
Z  1 6u .  .  I  <  u  Z   g(i,i)oj    lu.J 


Therefore 


(n-i+2) (n-i+1)   n-i  .- 
<  ~ 0)    u  u.  . 

—       2  '  li 


(  L5UD  e)  .  =  Z  £.  .   S  6u._ 
3      k 


<u£  <°-J+2)(n-J+l)  Mn-J|,,,  S.J. 
"   j       2  1]  " 

and  so  using  Lemma  B.3  gives 

,|T»m|ax   ,   2n   °   (n-j+2)(n-j+l)   J   r^j-l-^-.  .-   i 

(|L6UD|e).  <_  to  u   E  ■■ „ — * Z   |2      ||a   | 

1  "       j=l        Z       £--1  lJl 

<w2nu  J   (n-J+2)(n-J+l)  rf-2}      J   |j   , 
j=l  £=1    l£ 

from  which  the  lemma  follows.   Q.E.D. 

LEMMA  B.5.  The  matrices   SU  and  6L  of  Lemmas   A.  5  anc?  A.  7  satisfy 

|6L(U  +  6U)D|e  £  3  (2n-n-l)a)2nu|  AD  |  e. 

PROOF.   Applying  the  inequality   u..   <   u..   of  Lemma  B.3  and 

1    ij '  —  '    ii1 

the  bounds   of  Lemma  A. 5,   we   get 

^\~  I  r,  n-l+ll-  I 

Z   u.  .    +  6u.  .      <      Z      u)  u.  . 

.     ij  ii '        •■, ■■   .  ii 

3  J  3=i 

<    (n-i+l)u,n-i+1|u..|. 

Therefore  applying  the  bounds  of  Lemma  A. 7  gives 

(|6L(U  +  6U)D|e).  =  Z  I  6£  .  .  I  Z  I  G  #1  +  6u..  I 
i   j    13  k  3k     3k' 

<  Z  min  {n-j+1,  a-l}u>n""j  laf^/af^  I  (a-j+l)a>n'j+1|uJ1  A  , 

-  j  '  13    33  '  {    33* 

and  so  using  Lemma  B.3  gives 


56 


Appendix  C.      Error  Bounds  for  Iterative  Improvement 


LEMMA  C.l.  The  computed  residual  satisfies 


r     =  b  -  Ax     +  c 
m  mm 


with 


where 


c        <ub-Ax+wAx 
m'    —  m'  '    m 


w  = 


"(I  +  u)     [(1  +  ft)"   -   1] 
il   +  u)     [(1   +  u2)n   -    1] 


for  s.p.   aooum. 
for  d.p.   aooum. 


and   u  =  u/(l  +  u) . 

PROOF.   The  computed  residual 

r  =  b  -  (Ax  +  c')  +  c" 
m         m    m     m 

where  c'  is  the  error  due  to  computing  Ax  and  c"  is  the  rest  of  the  error, 
m  mm 

For  the  single  precision  case  we  have 
|cj  1  [(1  +  u)n  -  1]  |  A  |  |  x 
and 


m' 


m' 


so  that 


c"   <  u  b  -  Ax  -  c' 
m  —         m    m 


c   <   c'   +  c" 
m'  —  '  m1    '  m1 


<  ulb  -  Ax   +  (1  +  u)  [(1  +  u)n  -  1]  IaIIx  I. 
—         m  m 


For  the  double  precision  case  we  have 


c'|  <  [(l  +  u2)n  -  1]  |A||x J 

m  —  m 


and 


so  that 


|cn|  <  ulb  -  Ax  -  c' 
1  m1  —   '       m    it 


c    <   c1   +  c" 
m  —   m1      m 


2,n 


<  ulb  -  Ax  I  +  (1  +  u)  [(1  +  u  )   -  1]   Ax   .   Q.E.D, 
—   '      m1  m 
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2     n  J     -_1  o 

(|SL(U  +  6U)D|e).  <  a)  n  u  E   min  {n-j+1,  n-1}  (n-j+1)  E  \23        l\  a   | 

1        j=l  1=1  Xl 

2  n  •  o   n 

£co   u  E  min  {n-j+1,  n-1}  (n-j+1)  [2J   1   E  |a   |, 

j=l  1=1      l£ 

from  which  the  theorem  follows.   Q.E.D. 

THEOREM  5.1.  Let  the  veotor   &  be  computed  by  Gaussian  elimination 
with  column  pivoting  and  column  scaling  where   D  =  diag(d  ,  d  , . . . ,  d  ) 
is   the  matrix  of  scale  factors.      Then  there  exists  A  A  such  that 

I  (A  +A  A)x  =  b 

with 
|AAD|e  <_   x(n)u|AJ)|e 
where 


-  t    \         i  o-7  on-2    c  -i  i  2nu 

X(n)  =  L27- 2    -  5n  -  7Je 

PROOF.   The  theorem  follows  from  the  bounds  of  Lemmas  B.2,  B.4, 

and  B.5  and  from  the  equality 

(A  +  E  +  6LU  +  (L  +  6L)6U)x  =  b.   Q.E.D. 
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LEMMA  C.5.  We  have 


uv  +  w 


T^     I  UV  T  W  II  |  ,  I 

lim   z    <  — A   x 

"  m"  —  1  -  t  "  '  "  ' 


provided  that   t  <  1  where 

t  =  u  +  (2v  +  uv  +  w)y. 
PROOF.   From  Lemma  C.4 


\U   +A\<¥Z-2£LW*    ll  +  f^^^lllAllxHI.      Q.E.D. 
' '    m+1 "  —  1   -   vy  "    m "        1-vy 

LEMMA  C.6.      We  have 


w       rl  a  I  I      I  II  i  .  I  I      I  II    i  uv   +   w 


T~- —    I  w       r|  .  I  ||  I  .  I         llli,     UV   T   W 

lim     z        < {A     x     -        A     x     e}   +  — 

1    m1    —  1  -  u i     ii     in  1-t 


A     x      e 


m-x» 
provided  that  t  <   1. 

PROOF.      From  Lemma   C.4 


lim    I  z        <   u  lim    |z    I    +w{|a||x|    -|||A||x||e} 

1  m  I   —  I  m  I 


m'  —       '  m 
m-H»  m>o° 

+  ((f^1-  u)  limjz  ||  +  ^tw|||A||x|||} 
1  -  vy         "  m"   i  -  vy        ' 
m-x» 

and  the  lemma  follows  from  Lemma  C.5.   Q.E.D. 

THEOREM  C.7.  Assume    |a||x|  >  0.  Then 

(1  +  uy) (uv  +  w)o  w(g  -  1) 
1-t      "l-u 


lim  n  <  — 

m  —    .,       (1  +  u)  (uv  +  w)ya 

ni-xx>         l  -  u  -  z ' — 

1-t 

provided  that    (1  +  u) (uv  +  w)ya  <  (1  -  u) (1  -  t)  where   a  =  oR(A,  x) 

PROOF.   From  Lemmas  C.5  and  C.6  we  have 

T~r~  ii   ii    (uv  +  w)a  I  .  i  i  I 

lim  z   e  <  •i— z A  x 

"  m"  —   1-t   'iii 
m-x» 


and 


-TT-   I   I    r  (uv  +  w)a   w(a  -  l)  -.  I  .  I  I  i 

lim  z  <  {-z—z »=■ -}  Ax. 

1  m'  —    1-T       l-u   '!'■ 
m+oo 


Hence  using  Lemma  C.3, 
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LEMMA  C.2.      Define   z     =  A(x     +  d     -  x) .      Then 

m  mm 

I  z        <u|A(x     -  x)     +  w  A     x 
i    mi   _     i        m  i  i     i  i    mi 

+    (1   -  vy)_1v    (||  |A|  |xm  -   x|  ||  +  yu||A(xm  -   x)||  +  Yw||  |A||xJII>e 
where  v  =  x(n)u  °-nd  y  =  Cond(A     ). 

PROOF.      The  correction   term  dm  =    (A  +  F   )~   r     where  F      is   the  AA 


m 


mm  m 


of   Theorem  4.1.      We  have 


z      =  A(x     -x)+r      -    (I  +  F  A      )         FA~r 
mm  m  m  mm 

=   c      -    (I  +  F  A-1)"1   F    (x     -   x  -  A_1    c   ). 
m  m  m     m  m 

It  follows  from  the  bounds  of  Theorem  4.1  that 

-1 


z   <  |c  I  +  (1  -  vy)  v  (||  |a|  |x  -  x|  II  +yI|c  ||)e. 

m  —  '  m'  "  '  '  '  m  '  "    "  m" 

Substituting  the  bounds  of  Lemma  C.l  into  this  proves  the  lemma.   Q.E.D. 
LEMMA  C.3.  We  have 


and 


A(x  , ,  -x)   <   z   +  ul A| | x  +  uy  z   e 
m+1      '  —  '  m'     i  ii  i      "m" 


A  x  .,  -  x  <  u  A  x  +  (1  +  u)y  z   e. 
i  i  i  jn-i-i     i  _   i  i  i  i  "  m" 

PROOF.   The  new  iterate 


x    =  x  +  d  +  g 
m+1    m    m    m 


where  |g   <_   u|x  +  d  |.   Equivalently 


m"  —        m 


x    =  x  +  A  z  +g 
m+1  m    m 


where  |g   <u|x|  +u|A  z  I,  from  which  the  lemma  follows.   Q.E.D, 
m  m 

LEMMA  C.4.  We  have 


z 


<   u(|z   -  ||z  ||e)  +  w(|A|  |x|  -  I!  |A|  |x|  ||e) 


m+1 '   —        '    m '        ' '    m 

+    (1   -  vy)~      {  [u  +  y(v  +  uv  +  w)]  ||z    II  +    (uv  +  w)||  |A|  |x|  ||}e 

2 
where  w  =  u     +  w(l  +  u) . 

PROOF.   Substituting  the  bounds  of  Lemma  C.3  into  those  of 

Lemma  C.2  proves  the  lemma.   Q.E.D. 
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PROOF.   Substituting  the  inequality  of  Lemma  C.8  into  those 
of  Lemma  C.3  gives 

(v  +  u)wy   v(u  +  vy  +  w) (1  +  uy) 


|a(x2  -  x) |  <   (w  +  u)  |a|  |x|  +  i-^-z 

and 

(1  •  vy) 

and   the    theorem   follows    from   the   inequality 

|A(x2   -   x| 

n      <  max  -r— r-i — i rrn T     •      Q-E.D, 

2  —  A     x     -      A     x„   -   x 


VY  (1  -  vy)2 


.ii  i  i  a  i  i    i         (w  +  uv  +  vy),.,  x      in  .  ii     i  ii 

A|  |x9   ■  ■_  u|A|  |x|    +  ■  — y2-   (1  +  u)y  |[  |A|  |x|  ||e, 


Axe 
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m-**> 
and 


tt—   i  .  /  n  i         r  (1  +  uy)  (uv  +  wr)a        w(a  -   1)    .    •  ■,    i  .  i  i     i 

lim     A(x     -   x)      <    {- "-* ^ +  u}      Ax 

1        m  '   —  1-t  1-u  i    i  i    i 


<    {u+   (1  +  u)(uv+w)ya}    |A| 

—  1-T  '      ' 


lxm     Ax     -x|    <{u+  ; L— }    |A|  |x|  . 

1     '  '    m 
nv-*» 


The   theorem  follows   from 


and   thus 


So 


|A(x     -   x)|  |A(x      -   x)| 

m  '  m 

H      =  max      I  .  I  I 1 <    max    i-tt-i — i r-rT-i T   •       Q.E.D. 

m  Ax  —  Ax      -      Ax      -   x 

i     i  i    mi  i     i  i     i  i     i  i    m  i 

LEMMA  C.8.     Vie  have 

l       l  l  a  l  l     l     .     r      wvY       ,    v(u  +  vy+wy)-,|||.||     in 

z.      <  wAx     +  {- —  +  — ^ ' ^r-^}        Axe. 

1    l1    —      '     '  '     '  1   -  vy  /n  N2  ' 

(1  -  vy) 

PROOF.      The   first    iterate   x     =    (A  +  AA)~   Ax, 


x     -   x  =   -A  1(I  +  AA  A   1)    "'"AAx. 


and 


A(x1  -  x)||  <    (1  -  vy)    \  ||  |A|  |x| 


a|  J  x1    -   x|  ||  <_   (1   -  vy)      vy  |    |a|  |x| 


From  Lemma   C.2  we  have 


z,  |  £  w|  A  |  |  x| 


-1 


+  (1  -  vy)   {(v  +  w)||  |A|  |X]_  -  x|  II  +  u||A(Xl  -  x)||  +  vwy||  |a|  |x|  ||}. 
The  lemma  follows  by  substituting  the  previous  two  inequalities  into  this.   Q.E.D. 
THEOREM  C.9.  Assume   that    \k\  |x|  >  0.  Then 

(v  +  u)wya   (u  +  vy  +  wy)  (1  +  uy)vo~ 

W  +  U  +   :;           +                   ^ 
n   <  (1  "  VT> 

2  —  2 

n       (w  +  uv  +  v  y)  ,1    ,. 
1-u -Z — —  (1  +  u)ya 

(1  -  vy)Z 

provided  that  the  denominator  is  positive. 
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